HOW TO OPTIMIZE FOR THE PENTIUM PROCESSOR
*****************************************
Copyright (c) 1996 by Agner Fog

Contents:

1.  introduction
2.  literature
3.  debugging and verifying
4.  memory model
5.  alignment
6.  cache
7.  address generation interlock
8.  pairing integer instructions
9.  first time vs. repeated execution
10. imperfect pairing
11. splitting complex instructions into simpler ones
12. jumps and branches
13. prefixes
14. reducing code size
15. scheduling floating point code
16. loops
17. discussion of special instructions
18. using integer instructions to do floating point operations
19. using floating point instructions to do integer operations
20. list of integer instructions
21. list of floating point instructions
22. testing code speed
23. considerations for other microprocessors


1. INTRODUCTION
===============
This manual describes in detail how to write optimized assembly language
code, with particular focus on the Pentium(R) microprocessor.

The manual is more detailed than what you will find elsewhere, enabling you 
in many cases to calculate exactly how many clock cycles a piece of code 
will take.  

Programming in assembly language is much more difficult than high level 
language. Making bugs is very easy, and finding them is very difficult. Now 
you have been warned! It is assumed that the reader already is experienced 
in assembly programming. If not, then please read some books on the subject 
and get some programming experience before you begin to do complicated 
optimations.  

The hardware design of the Pentium chip has many features which are 
optimized specifically for some commonly used instructions or instruction 
combinations, rather than using general optimation methods. Consequently, 
the rules for optimizing software for this design are quite complicated and 
have many exceptions, but the possible gain in performance may be 
substantial.  

Before you start to convert your code to assembly, make sure that your 
algorithm is optimal. Often you can improve a piece of code much more by 
improving the algorithm than by converting it to assembler code.  

Some high level language compilers offer reasonably good optimation for the 
Pentium, and further optimation by hand may not be worth the effort except 
for the most critical pieces of code - or for the sport of it.  

Intel has recently announced that they will produce new versions of the 
Pentium and PentiumPro processors with 57 new instruction opcodes for doing
integer vector operations. These are called multimedia extension (MMX)
instructions.  The Pentium chip with MMX will behave slightly different from
the Pentium without MMX, according to the technical documents. These
differences will be mentioned where appropriate. The Pentium Pro processor
behaves very differently from the Pentium, and will be covered only briefly
by this manual.

The informations in this document are based mainly on my own experiments on
a Pentium computer. Informations about PentiumPro and MMX processors are
based solely on documents from Intel. Karki J. Bahadur from Univ. Colorado
is thanked for useful contributions.

Please don't send your programming questions to me. I am not gonna do your
homework for you.

Good luck with your hunt for nanoseconds!


2. LITERATURE
=============
A lot of useful literature can be downloaded for free from Intel's www site:
www.intel.com

You can find the documents you need by using the search facilities at:
http://www-cs.intel.com/search.htm
and:
http://www-techdoc.intel.com/cgi-bin/taos.pl

The documents are in various different file formats. If a particular
document is in a format not supported by your word processing software then
you may seek an appropriate file viewer somewhere on the internet. Many
software companies are offering such file viewers for free to support their
file formats.

The most useful document is Intel's application note:
'AP-526 Optimizations for Intel's 32-bit Processors'
which can be downloaded from the following URL:
http://www-techdoc.intel.com/cgi-bin/synopsis_spangle.pl?\docs\microproc\general\app_note\242816+0

A fancy tutorial named "Optimizing Applications for the Pentium Processor"
is awailable at:
http://www.intel.com/ial/processr/cbt.htm

Detailed information on the MMX processors can be found in the documents:
"Intel architecture MMX technology developer's manual", and
"Intel architecture MMX technology programmers reference manual"
both of which are awailable from:
http://www.intel.com/pc-supp/multimed/MMX

A lot of other sources than Intel also have useful information. These sources
are listed in the comp.lang.asm.x86 FAQ. I would particularly recommend
http://www.x86.org.
The shareware editor ASMEDIT has an online help which covers all instruction
codes etc. It is available from:
http://www.inf.tu-dresden.de/~ok3/asmedit.html


3. DEBUGGING AND VERIFYING
==========================
Debugging assembly code can be quite hard and frustrating, as you probably 
already have discovered. I would recommend that you start with writing the 
piece of code you want to optimize as a subroutine in a high level language.  
Next, write a test program that will test your subroutine thoroughly. Make 
sure the test program goes into all branches and special cases.

When your high level language subroutine works with your test program, then 
you are ready to translate the code to assembly language (some high level 
compilers will do this for you), and test it on the test program.

Then you can start to optimize. Each time you have made a modification you 
should run it on the test program to see if it works correctly.

Number all your versions and save them so that you can go back and test them
again in case you discover an error which the test program didn't catch 
(such as writing to a wrong address).


4. MEMORY MODEL
===============
The Pentium is designed primarily for 32 bit code, and it's performance is 
inferior on 16 bit code. Segmenting your code and data also degrades
performance significantly, so you should generally prefer 32 bit flat mode, 
and an operating system which supports this mode (e.g. Windows 95, OS/2, or
a 32 bit DOS extender). The code examples shown in this manual assume a 32
bit flat memory model, unless otherwise specified.


5. ALIGNMENT
============
All data in RAM should be aligned to adresses divisible by 2, 4, or 8
according to this scheme:

operand size      alignment
---------------------------
1  (byte)          1
2  (word)          2    (or address MOD 4 >< 3. other proc. require align by 2)
4  (dword)         4
6  (fword)         4    (Pentium Pro requires alignment by 8)
8  (qword)         8
10 (tbyte)         8    (Pentium Pro requires alignment by 16)

Misaligned data will take at least 3 clock cycles extra to access.

Aligning code is not necessary when you run on the Pentium, but for optimal
performance on other processors you may align subroutine entries and loop 
entries by 8 or 16.


6. CACHE
========
The Pentium has 8 kb of on-chip cache (level one cache) for code, and 8 kb
for data. Data in the level 1 cache can be read or written to in only one
clock cycle, whereas a cache miss may cost many clock cycles. It is 
therefore important that you understand how the cache works in order to use 
it most effectively. Most descriptions of the cache in other documents are 
insufficient and difficult to understand. I hope you find this one better.

The data cache consists of 256 lines of 32 bytes each. Each time you read a 
data item which is not cached, the processor will read an entire cache line 
from memory. The cache lines are always aligned to a physical address 
divisible by 32. When you have read a byte at an address aligned by 32, then 
the next 31 bytes can be read or written to at almost no extra cost. You can 
take advantage of this by arranging data which are used near each other into 
aligned blocks of 32 bytes of memory. If, for example, you have a loop which 
accesses two arrays, then you may combine the two arrays into one array of 
structures, so that data which are used together are also stored together.

If the size of an array or other data structure is a multiple of 32 bytes, 
then you should preferably align it by 32.

A cache line can not be assigned to an arbitrary memory address. Each cache 
line has a 7-bit set-value which must match bit 5 through 11 of the physical
RAM address (bit 0-4 define the 32 bytes within a cache line). There are two
cache lines for each of the 128 set-values, so there are two possible cache
lines to assign to any RAM address. The consequence of this is that the
cache can hold no more than two different data blocks which have the same 
value in bits 5-11 of the address. You can determine if two addresses have
the same set-value by the following method: Strip off the lower 5 bits of
each address to get a value divisible by 32. If the difference between the 
two truncated addresses is a multiple of 4096 (=1000H), then the addresses
have the same set-value.

Let me illustrate this by the following piece of code, where ESI holds an 
address divisible by 32:

AGAIN:  MOV  EAX, [ESI]
        MOV  EBX, [ESI + 13*4096 +  4]
        MOV  ECX, [ESI + 20*4096 + 28]
        DEC  EDX
        JNZ  AGAIN

The three addresses used here all have the same set-value because the 
differences between the truncated addresses are multipla of 4096. This loop
will perform very poorly. At the time you read ECX there is no free cache
line with the proper set-value so the processor takes the least recently 
used of the two possible cache lines, that is the one that was used for EAX, 
and fills it with the data from [ESI + 20*4096] to [ESI + 20*4096 + 31] and
then reads ECX.  Next, when reading EAX, you find that the cache line that
held the value for EAX has been discarded, so you take the least recently 
used line, which is the one holding the EBX value, and so on..  You have 
nothing but cache misses and the loop takes something like 60 clock cycles.
If the third line is changed to:

        MOV  ECX, [ESI + 20*4096 + 32]

then we have crossed a 32 byte boundary, so that we do not have the same 
set-value as in the first two lines, and there will be no problem assigning 
a cache line to each of the three addresses. The loop now takes only 3 clock 
cycles (except for the first time) - a very considerable improvement!

It may be very difficult to determine if your data addresses have the same 
set-values, especially if they are scattered around in different segments.
The best thing you can do to avoid problems of this kind is to keep all data 
used in the critical part or your program within one contiguous block of 
max.  8 kbytes or two contiguous blocks of max. 4 kbytes each (for example
one block for static data and another block for data on the stack). This
will make sure that your cache lines are used optimally.

If the critical part of your code accesses big data structures or random 
data addresses, then you may want to keep all frequently used variables
(counters, pointers, control variables, etc.) within a single contiguous 
block of max 4 kbytes so that you have a complete set of cache lines free 
for accessing random data. Since you probably need stack space anyway for 
subroutine parameters and return addresses, the best thing is to copy all 
frequently used static data to dynamic variables on the stack, and copy them 
back again outside the main loop if they have been changed.

Reading a data item which is not in the level one cache causes an entire 
cache line to be filled from the level two cache, which takes approximately 
200 ns (that is 20 clocks on a 100 MHz system), but the bytes you ask for
first are available already after 50-100 ns. If the data item is not in the
level two cache either, then you will get a delay of something like 200-300 
ns. This delay will be somewhat longer if you cross a DRAM page boundary.
(The size of a DRAM page is 1 kb for 4 MB 72 pin RAM modules, and 2 kb for
16 MB modules).

When you write to an address which is not in the level 1 cache, then the 
value will go right through to the level 2 cache or the RAM (depending on
how the level 2 cache is set up). This takes approximately 100 ns. If you
write eight or more times to the same 32 byte block of memory without also
reading from it, and the block is not in the level one cache, then it may be 
advantageous to make a dummy read from the block first to load it into a 
cache line. All subsequent writes to the same block will then go to the 
cache instead, which takes only one clock cycle. (This is not needed on a 
PentiumPro, which always loads a cache line on write misses). There is 
sometimes a small penalty for writing repeatedly to the same address without
reading in between.

Another way to reduce the number of write misses is to write 8 bytes at a
time using  FILD / FISTP  with qword operands rather than writing 4 bytes at
a time using integer registers. The extra cost of the slow FILD and FISTP
instructions is compensated for by the fact that you only have to do half as
many writes. However, this method is only advantageous on a Pentium, and
only if the destination is not in the cache. The method is explained in
section 19 below.

Temporary data may conveniently be stored at the stack because stack data
are very likely to be in the cache. You should be aware, however, that if
you store QWORD data on a DWORD size stack, or DWORD data on a WORD size
stack, then you have a potential alignment problem.

If the life ranges of two data structures do not overlap, then they may use
the same RAM area to increase cache efficiency. This is consistent with the
common practice of allocating space for temporary variables on the stack.

Storing temporary data in registers is of course even more efficient. Since
registers is a scarce ressource, you may want to use [ESP] rather than [EBP]
for addressing data on the stack, in order to free EBP for other purposes.
Just don't forget that the value of ESP changes every time you do a PUSH or 
POP. (You cannot use ESP under 16-bit Windows because the timer interrupt 
will modify the high word of ESP at unpredictable places in your code.)

The Pentium has a separate 8 kb cache for code, which is similar to the data
cache. It is important that the critical part of your code (the innermost
loops) fit in the code cache. Frequently used pieces of code or routines 
which are used together should preferable be stored near each other. Seldom
used branches or procedures should go to the bottom of your code.


7. ADDRESS GENERATION INTERLOCK (AGI)
=====================================
It takes one clock cycle to calculate the address needed by an instruction
which accesses memory. Normally, this calculation is done at a separate
stage in the pipeline while the preceding instruction or instruction pair is
executing. But if the address depends on the result of an instruction
executing in the preceding clock cycle, then you have to wait an extra clock
cycle for the address to be calculated. This is called an AGI stall.

Example:
ADD EBX,4 / MOV EAX,[EBX]    ; AGI stall
The stall in this example can be removed by putting some other instructions
in between  ADD EBX,4  and  MOV EAX,[EBX]  or by rewriting the code to:
MOV EAX,[EBX+4] / ADD EBX,4

You can also get an AGI stall with instructions which use ESP implicitly for
adressing, such as PUSH, POP, CALL, and RET, if ESP has been changed in the 
preceding clock cycle by instructions such as MOV, ADD, or SUB. The Pentium
has special circuitry to predict the value of ESP after a stack operation so 
that you do not get an AGI delay after changing ESP with PUSH, POP, or CALL.
You can get an AGI stall after RET only if it has an immediate operand to
add to ESP.

Examples:
ADD ESP,4 / POP ESI            ; AGI stall
POP EAX   / POP ESI            ; no stall, pair
MOV ESP,EBP / RET              ; AGI stall
CALL L1 / L1: MOV EAX,[ESP+8]  ; no stall
RET / POP EAX                  ; no stall
RET 8 / POP EAX                ; AGI stall

The LEA instruction is also subject to an AGI stall if it uses a base or
index register which has been changed in the preceding clock cycle. Example:
INC ESI / LEA EAX,[EBX+4*ESI]  ; AGI stall


8. PAIRING INTEGER INSTRUCTIONS
===============================
The Pentium has two pipelines for executing instructions, called the U-pipe 
and the V-pipe. Under certain conditions it is possible to execute two 
instructions simultaneously, one in the U-pipe and one in the V-pipe. This 
can almost double the speed. It is therefore advantageous to reorder your 
instructions to make them pair.

The following instructions are pairable in either pipe:
  MOV register, memory, or immediate into register or memory
  PUSH register or immediate
  POP register
  LEA, NOP
  INC, DEC, ADD, SUB, CMP, AND, OR, XOR, 
  and some forms of TEST (see section 17.2)

The following instructions are pairable in the U-pipe only:
ADC, SBB
SHR, SAR, SHL, SAL with immediate count
ROR, ROL, RCR, RCL with an immediate count of 1

The following instructions can execute in either pipe but are only pairable 
when in the V-pipe: near call, short and near jump, short and near 
conditional jump.

All other integer instructions can execute in the U-pipe only, and are not 
pairable.

Two consecutive instructions will pair when the following conditions are met:

8.1 The first instruction is pairable in the U-pipe and the second instruction
    is pairable in the V-pipe.

8.2 The second instruction cannot read or write a register which the first
    instruction writes to. Examples:
    MOV EAX, EBX / MOV ECX, EAX     ; read after write, do not pair
    MOV EAX, 1   / MOV EAX, 2       ; write after write, do not pair
    MOV EBX, EAX / MOV EAX, 2       ; write after read, pair OK
    MOV EBX, EAX / MOV ECX, EAX     ; read after read, pair OK
    MOV EBX, EAX / INC EAX          ; read and write after read, pair OK

8.3 In rule 8.2 partial registers are treated as full registers. Example:
    MOV AL, BL  /  MOV AH, 0        ; writes to different parts of the same
                                    ; register, do not pair

8.4 Two instructions which both write to parts of the flags register can pair
    despite rule 8.2 and 8.3.  Example:
    SHR EAX,4 / INC EBX   ; pair OK

8.5 An instruction which writes to the flags can pair with a conditional jump
    despite rule 8.2.  Example:
    CMP EAX, 2 / JA LabelBigger   ; pair OK

8.6 The following instruction combinations can pair despite the fact that they
    both modify the stack pointer:
    PUSH + PUSH,  PUSH + CALL,  POP + POP

8.7 There are restrictions on the pairing of instructions with prefix.
    There are several types of prefixes:
    - instructions adressing a non-default segment have a segment prefix.
    - instructions using 16 bit data in 32 bit mode, or 32 bit data in 16 bit
      mode have an operand size prefix.
    - instructions using 32 bit base or index registers in 16 bit mode have
      an address size prefix.
    - repeated string instructions have a repeat prefix.
    - locked instructions have a LOCK prefix.
    - many instructions which were not implemented on the 8086 processor
      have a two byte opcode where the first byte is 0FH. The 0FH byte
      behaves as a prefix on the Pentium without MMX, but not on the Pentium 
      with MMX. The most common instructions with 0FH prefix are: MOVZX, 
      MOVSX, PUSH FS, POP FS, PUSH GS, POP GS, LFS, LGS, LSS, SETcc, BT, 
      BTC, BTR, BTS, BSF, BSR, SHLD, SHRD, and IMUL with two operands and no
      immediate operand.

    On the Pentium without MMX, a prefixed instruction can only execute in
    the U-pipe, except for conditional near jumps.
    On the Pentium with MMX, instructions with operand size, address 
    size, or 0FH prefix can execute in either pipe, whereas instructions with 
    segment, repeat, or lock prefix can only execute in the U-pipe.

8.8 An instruction which has both a displacement and immediate data is not
    pairable on the Pentium without MMX and only pairable in the U-pipe on
    Pentium with MMX:
    MOV DWORD PTR DS:[1000], 0    ; not pairable or only in U-pipe
    CMP BYTE PTR [EBX+8], 1       ; not pairable or only in U-pipe
    CMP BYTE PTR [EBX], 1         ; pairable
    CMP BYTE PTR [EBX+8], AL      ; pairable

8.9 Both instructions must be preloaded and decoded. This is explained in
    section 8.


9. FIRST TIME VERSUS REPEATED EXECUTION
=======================================
A piece of code usually takes much more time the first time it is executed
than when it is repeated. The reasons are the following:

9.1 Loading the code from RAM into the cache takes longer time than
    executing it.

9.2 In some cases decoding the code is the bottleneck. If it takes one clock
    cycle to determine the length of an instruction, then it is not possible 
    to decode two instructions per clock cycle, because the processor 
    doesn't know where the second instruction begins. The Pentium solves 
    this problem by remembering the length of any instruction which has 
    remained in the cache since last time it was executed. As a consequence
    of this, a set of instructions will not pair the first time they are 
    executed, unless the first of the two instructions is only one byte 
    long.

9.3 Branches will not be in the branch target buffer the first time they
    execute, and therefore are less likely to be predicted correctly.

9.4 Any data accessed by the code has to be loaded into the cache, which may
    take much more time than executing the instructions. When the code is 
    repeated, then the data are more likely to be in the cache.  

For these four reasons, a piece of code inside a loop will generally take 
much more time the first time it executes than the subsequent times.

If you have a big loop which doesn't fit in the code cache then you will get 
the penalty of 9.1 and 9.2 all the time because it doesn't run from the
cache. You should therefore try to reorganize the loop to make it fit into 
the cache.  

If you have more than 256 jumps and branches inside a loop, then you will 
get the penalty in 9.3 all the time.

Likewise, if a loop repeatedly accesses a data structure too big for
the data cache, then you will get the penalty of data cache misses all the
time.


10. IMPERFECT PAIRING
=====================
There are situations where the two instructions in a pair will not execute
simultaneously, or only partially overlap in time. They should still be
considered a pair, because the first instruction executes in the U-pipe, and
the second in the V-pipe. No subsequent instruction can start to execute
before both instructions in the imperfect pair have finished.

Imperfect pairing will happen in the following cases:

10.1 If the second instructions suffers an AGI stall (see chapter 7).

10.2 Two instructions cannot access the same dword of memory simultaneously.
     The following examples assume that ESI is divisible by 4:
     MOV AL, [ESI] / MOV BL, [ESI+1]
     The two operands are within the same dword, so they cannot execute
     simultaneously. The pair takes 2 clock cycles.
     MOV AL, [ESI+3] / MOV BL, [ESI+4]
     Here the two operands are on each side of a dword boundary, so they pair
     perfectly, and take only one clock cycle.

10.3 Rule 10.2 is extended to the case where bit 2-4 is the same in the two
     addresses (cache bank conflict). For dword addresses this means that
     the difference between the two adresses should not be divisible by 32.
     Examples:
     MOV [ESI], EAX / MOV [ESI+32000], EBX ;  imperfect pairing
     MOV [ESI], EAX / MOV [ESI+32004], EBX ;  perfect pairing

Pairable instructions which do not access memory take one clock cycle,
except mispredicted jumps. MOV instructions to or from memory also take only
one clock cycle if the data area is in the cache and properly aligned. There
is no penalty for using complex addressing modes such as scaled index
registers.

Pairable instructions which read from memory, does some calculation, and
stores the result in a register or flags, take 2 clock cycles. (read/modify
instructions).

Pairable instructions which read from memory, does some calculation, and
writes the result back to the memory, take 3 clock cycles.
(read/modify/write instructions).

10.4 If a read/modify/write instruction is paired with a read/modify or
     read/modify/write instruction, then they will pair imperfectly.

The number of clock cycles used is given in this table:

                      |             First instruction
                      | MOV or             read/       read/modify/
Second instruction    | register only      modify      write
----------------------|----------------------------------------------
MOV or register only  |      1               2              3
read/modify           |      2               2              4
read/modify/write     |      3               3              5
----------------------|-----------------------------------------------

Examples:
ADD [mem1], EAX / ADD EBX, [mem2]  ; 4 clock cycles
ADD EBX, [mem2] / ADD [mem1], EAX  ; 3 clock cycles

10.5 When two paired instructions both take extra time due to cache misses,
     misalignment, or jump misprediction, then the pair will take more time
     than each instruction, but less than the sum of the two.

In order to avoid imperfect pairing you have to know which instructions go
into the U-pipe, and which to the V-pipe. You can find out that by looking
backwards in your code and search for instructions which are unpairable,
pairable only in one of the pipes, or cannot pair due to one of the rules in
chapter 8.

Imperfect pairing can often be avoided by reordering your instructions.
Examples:

L1:     MOV     EAX,[ESI]
        MOV     EBX,[ESI]
        INC     ECX

Here the two MOV instructions form an imperfect pair because they both
access the same memory location, and the sequence takes 3 clock cycles.
You can improve the code by reordering the instructions so that  INC ECX
pairs with one of the MOV instructions.

L2:     MOV     EAX,OFFSET [A]
        XOR     EBX,EBX
        INC     EBX
        MOV     ECX,[EAX]
        JMP     L1

The pair  INC EBX / MOV ECX,[EAX]  is imperfect because the latter
instruction has an AGI stall. The sequence takes 4 clocks. If you insert
a NOP or any other instruction so that  MOV ECX,[EAX] pairs with  JMP L1
in stead, then the sequence takes only 3 clocks.

The next example is 16 bit mode, assuming that SP is divisible by 4:

L3:     PUSH    AX
        PUSH    BX
        PUSH    CX
        PUSH    DX
        CALL    FUNC

Here the PUSH instructions form two imperfect pairs, because both operands
in each pair go into the same dword of memory. PUSH BX could possibly pair
perfectly with PUSH CX (because they go on each side of a dword boundary)
but it doesn't because it has already been paired with PUSH AX. The sequence
therefore takes 5 clocks. If you insert a NOP or any other instruction so
that PUSH BX pairs with PUSH CX, and PUSH DX with CALL FUNC, then the
sequence will take only 3 clocks. Another way to solve the problem is to
make sure that SP is not divisible by 4. Knowing whether SP is divisible by
4 or not in 16 bit mode can be difficult, so the best way to avoid this
problem is to use 32 bit mode.


11. SPLITTING COMPLEX INSTRUCTIONS INTO SIMPLER ONES
====================================================
You may split up read/modify and read/modify/write instructions to improve
pairing. Example:
ADD [mem1],EAX / ADD [mem2],EBX    ; 5 clock cycles
This code may be split up into a sequence which takes only 3 clock cycles:
MOV ECX,[mem1] / MOV EDX,[mem2] / ADD ECX,EAX / ADD EDX,EBX
MOV [mem1],ECX / MOV [mem2],EDX

Likewise you may split up non-pairable instructions into pairable instructions:
PUSH [mem1] / PUSH [mem2]  ; non-pairable
Split up into:
MOV EAX,[mem1] / MOV EBX,[mem2] / PUSH EAX / PUSH EBX  ; everything pairs

Other examples of non-pairable instructions which may be split up into simpler
pairable instructions:
CDQ  split into:  MOV EDX,EAX / SAR EDX,31
NOT EAX  change to  XOR EAX,-1
NEG EAX  split into  XOR EAX,-1 / INC EAX
MOVZX EAX,BYTE PTR [mem]  split into  XOR EAX,EAX / MOV AL,BYTE PTR [mem]
JECXZ  split into  TEST ECX,ECX / JZ 
LOOP   split into  DEC ECX / JNZ
XLAT   change to   MOV AL,[EBX+EAX]

If splitting instructions doesn't improve speed, then you may keep the 
complex or nonpairable instruction in order to reduce code size.  


12. JUMPS AND BRANCHES
======================
The Pentium attempts to predict whether a conditional jump is taken or not.
It uses a 'branch target buffer' (BTB), which can remember the history of 
256 jump instructions.

The Pentium without MMX makes the prediction on the basis of the last two 
events. It guesses that a conditional jump will be taken if it was taken
the previous time or the time before. It guesses that the jump is not taken 
if it was not taken the last two times. A conditional jump which has not 
been seen before (or is not in the BTB) is predicted as not taken.

The Pentium with MMX (and the PentiumPro) makes its prediction on the basis 
of the last four events, so that it is able to predict a simple repetitive
pattern. A conditional jump which has not been seen before (or is not in the 
BTB) is predicted as taken if it goes backwards (e.g. a loop), and as not 
taken if it goes forward.

If a conditional jump is properly predicted (i.e. if the guess was correct) 
then it takes only one clock cycle. A mispredicted branch takes 4 clock 
cycles if it is in the U-pipe and 5 clock cycles if it is in the V-pipe. 

The penalty for misprediction is much higher if another jump or call follows 
in the first clock cycle after the mispredicted branch. The Pentium behaves 
very strangely in this situation. The branch prediction mechanism gets 
totally messed up: The second branch may be mispredicted when it should be 
predicted, or vice versa. And the problem hangs over to the next time so
that the first branch is likely to be mispredicted next time it is executed, 
even when it should be predicted. Even unconditional jumps, calls and
returns will suffer a misprediction penalty when they follow in the first
clock cycle after a mispredicted branch. To avoid this you should avoid any
branch, jump, call, or return immediately after a poorly predictable branch
instruction or its target label. You may put in NOP's or other instructions
to avoid this. Example:

        DEC     EAX
        JNZ     L1
        CMP     EBX,ECX
        NOP
        JB      L2
        ...
L1:     NOP
        NOP
        CALL    P
        ...
L2:     NOP
        RET

The Pentium with MMX may also have a penalty in this case, but only if the 
two consequtive branch instructions end within the same aligned dword of 
memory. This problem may be avoided by using a near displacement rather 
than short on the second branch instruction to make it longer, but this 
method doesn't help on the Pentium without MMX so you should rather put in 
some NOP's or other instructions to avoid problems on both processors.

The jump prediction algorithm is optimal for a loop where the testing is at 
the bottom, as in this example: 

        MOV ECX, [N]
L:      MOV [EDI],EAX
        ADD EDI,4
        DEC ECX
        JNZ L

Since the jump prediction algorithm for the Pentium without MMX is 
asymmetric, there may be situations where you can improve performance by 
reordering your code. Consider for example this if-then-else construction: 

        TEST EAX,EAX
        JNZ  SHORT A1
        CALL F0
        JMP  SHORT E
A1:     CALL F1
E:

If F0 is called more often than F1, and F1 is seldom called twice in 
succession, then you can improve jump prediction on the Pentium without MMX 
by swapping the two branches. However, This will be slightly suboptimal on 
the Pentium with MMX and the PentiumPro because they may mispredict the 
branch if it is not in the branch target buffer. Another tradeoff is that 
the code cache is used less efficiently when the seldom used branch comes 
first. You may put in two NOP's before each of the CALL instructions here to
avoid the penalty of a call after a mispredicted jump.  

Multiway branches (case statements) are best implemented with a list of jump 
addresses on the Pentium without MMX. The jump addresses should be placed in 
the data segment, not in the code segment. On the Pentium with MMX and the 
PentiumPro, indirect jumps and calls through function pointers must be 
predictable in order to be effective, so on these processors it may be 
better to use multiple 2-way branches.

All calls should be matched with returns in order for the returns to be 
predicted correctly on the Pentium with MMX and the PentiumPro.

Avoiding branches
-----------------
Sometimes it is possible to obtain the same effect as a branch by ingenious 
manipulation of bits and flags. For example you may calculate the absoulte 
value of a signed number without branching:
MOV EDX,EAX
SAR EDX,31
XOR EAX,EDX
SUB EAX,EDX

The carry flag is particularly useful for this kind of tricks.
Setting carry if a value is zero:  CMP [value],1
Setting carry if a value is not zero:  XOR EAX,EAX  /  CMP EAX,[value]
Incrementing a counter if carry:  ADC EAX,0
Setting a bit for each time the carry is set:  RCL EAX,1
Generating a bit mask if carry is set:  SBB EAX,EAX

This example finds the minimum of two unsigned numbers:  if (b < a)  a = b;
SUB EBX,EAX
SBB ECX,ECX
AND ECX,EBX
ADD EAX,ECX

This example chooses between two numbers:  if (a != 0)  a = b;  else a = c;
CMP EAX,1
SBB EAX,EAX
AND ECX,EAX
XOR EAX,-1
AND EAX,EBX
OR  EAX,ECX

Whether or not such tricks are worth the extra code depend on how
predictable a conditional jump would be, whether the extra pairing
opportunities of the branch-free code can be utilized, and whether there
are other jumps following immediately after which could suffer the penalty
described above.

Un a PentiumPro you can use conditional move instructions to avoid branches.


13. PREFIXES
============
An instruction with one or more prefixes may not be able to execute in the
V-pipe (se paragraph 8.7), and it takes one clock cycle extra for each
prefix to decode, except for 0FH prefixes on the Pentium with MMX and 
conditional near jumps.

Adress size prefixes can be avoided by using 32 bit mode.
Segment prefixes can be avoided in 32 bit mode by using a flat memory model.
Operand size prefixes can be avoided in 32 bit mode by using only 8 bit and
32 bit integers.

Where prefixes are unavoidable, the decoding delay may be masked if a 
preceding instruction takes more than one clock cycle to execute. The rule 
for the Pentium without MMX is that any instruction which takes N clock 
cycles to execute (not to decode) can 'overshadow' the decoding delay of N-1
prefixes in the next two (sometimes three) instructions or instruction 
pairs. In other words, each extra clock cycle that an instruction takes to 
execute can be used to decode one prefix in a later instruction. This 
shadowing effect even extends across a predicted branch. Any instruction
which takes more than one clock cycle to execute, and any instruction which 
is delayed because of an AGI stall, cache miss, misalignment, or any other
reason except decoding delay and branch misprediction, has a shadowing 
effect. On the Pentium with MMX, unpaired instructions also have a shadowing 
effect.

Examples:

CLD / REP MOVSD
The CLD instruction takes two clock cycles and can therefore overshadow the 
decoding delay of the REP prefix. The code would take one clock cycle more 
if the CLD instruction was placed far from the REP MOVSD.

CMP DWORD PTR [EBX],0 / MOV EAX,0 / SETNZ AL
The CMP instruction takes two clock cycles here because it is a read/modify 
instruction. The 0FH prefix of the SETNZ instruction is decoded during the 
second clock cycle of the CMP instruction, so that the decoding delay is
hidden.


14. REDUCING CODE SIZE
======================
As explained in paragraph 6, the code cache is 8 kb. If you have problems
keeping the critical parts of your code within the code cache, then you may
consider reducing the size of your code.

32 bit code is usually bigger than 16 bit code because adresses and data
constants take 4 bytes in 32 bit code and only 2 bytes in 16 bit code.
However, 16 bit code has other penalties such as prefixes and problems with
accessing adjacent words simultaneously (see chapter 10). Some other
methods for reducing the size or your code are discussed below.

Both jump adresses, data adresses, and data constants take less space if
they can be expressed as a sign-extended byte, i.e. if they are within the
interval from -128 to +127.

For jump adresses this means that short jumps take two bytes of code,
whereas jumps beyond 127 bytes take 5 bytes if unconditional and 6 bytes if
conditional.

Likewise, data adresses take less space if they can be expressed as a
pointer and a displacement between -128 and +127.
Example:
MOV EBX,DS:[100000] / ADD EBX,DS:[100004]          ; 12 bytes
Reduce to:
MOV EAX,100000 / MOV EBX,[EAX] / ADD EBX,[EAX+4]   ; 10 bytes

The advantage of using a pointer obviously increases if you use it many
times. Storing data on the stack and using EBP or ESP as pointer will thus
make your code smaller than if you use static memory locations and absolute
adresses, provided of course that your data are within 127 bytes of the
pointer. Using PUSH and POP to write and read temporary data is even more
advantageous.

Data constants may also take less space if they are between -128 and +127.
Most instructions with immediate operands have a short form where the
operand is a sign-extended single byte. Examples:
PUSH 200      ; 5 bytes
PUSH 100      ; 2 bytes

ADD EBX,128   ; 6 bytes
SUB EBX,-128  ; 3 bytes

The most important instruction with an immediate operand which doesn't have
such a short form is MOV. Examples:
MOV EAX, 1              ; 5 bytes
XOR EAX,EAX / INC EAX   ; 3 bytes
PUSH 1 / POP EAX        ; 3 bytes

If the same constant is used more than once then you may of course load it
into a register. Example:
MOV DWORD PTR [EBX],0 / MOV DWORD PTR [EBX+4],0   ; 13 bytes
XOR EAX,EAX / MOV [EBX],EAX / MOV [EBX+4],EAX     ;  7 bytes

You may also consider that different instructions have different lengths.
The following instructions take only one byte and are therefore very
attractive: PUSH reg, POP reg, INC reg32, DEC reg32.
INC and DEC with 8 bit registers take 2 bytes, so  INC EAX  is shorter than
INC AL.

XCHG EAX,reg is also a single-byte instruction and thus takes less space 
than MOV EAX,reg, but it is slower and not pairable.

Some instructions take one byte less when they use the accumulator than when
they use any other register. Examples:

MOV EAX,DS:[100000]  is smaller than  MOV EBX,DS:[100000]
ADD EAX,1000         is smaller than  ADD EBX,1000

Instructions with pointers take one byte less when they have only a base 
pointer (not ESP) and a displacement than when they have a scaled index 
register, or both base pointer and index register, or ESP as base pointer. 
Examples: 
MOV EAX,[array][EBX]  is smaller than  MOV EAX,[array][EBX*4]
MOV EAX,[EBP+12]      is smaller than  MOV EAX,[ESP+12]
Instructions with EBP as base pointer and no displacement and no index take
one byte more than with other registers: 
MOV EAX,[EBX]    is smaller than  MOV EAX,[EBP],  but
MOV EAX,[EBX+4]  is same size as  MOV EAX,[EBP+4].  


15. SCHEDULING FLOATING POINT CODE
==================================
Floating point instructions cannot pair the way integer instructions can, 
except for one special case, defined by the following rules:

- the first instruction (executing in the U-pipe) must be FLD, FADD, FSUB, 
  FMUL, FDIV, FCOM, FCHS, or FABS

- the second instruction (in V-pipe) must be FXCH

- the instruction following the FXCH must be a floating point instruction,
  otherwise the FXCH will take an extra clock cycle.

This special pairing is important, as will be explained shortly.

While floating point instructions in general cannot be paired, many can be 
pipelined, i.e. one instruction can begin before the previous instruction has 
finished. Example:

FADD ST(1),ST(0)   ; clock cycle 1-3
FADD ST(2),ST(0)   ; clock cycle 2-4
FADD ST(3),ST(0)   ; clock cycle 3-5
FADD ST(4),ST(0)   ; clock cycle 4-6

Obviously, two instructions cannot overlap if the second instruction needs
the result of the first. Since almost all floating point instructions
involve the top of stack register, ST(0), there are seemingly not very many
possibilities for making an instruction independent of the result of
previous instructions. The solution to this problem is register renaming.
The FXCH instruction does not in reality swap the contents of two registers,
it only swaps their names. Instructions which push or pop the register
stack also work by renaming. Register renaming has been highly optimized on
the Pentium so that a register may be renamed while in use. Register
renaming never causes stalls - it is even possible to rename a register more
than once in the same clock cycle, as for example when you pair FLD or
FCOMPP with FXCH.

By the proper use of FXCH instructions you may obtain a lot of overlapping in
your floating point code. Example:

FLD     [a1]    ; clock cycle 1
FADD    [a2]    ; clock cycle 2-4
FLD     [b1]    ; clock cycle 3
FADD    [b2]    ; clock cycle 4-6
FLD     [c1]    ; clock cycle 5
FADD    [c2]    ; clock cycle 6-8
FXCH    ST(2)   ; clock cycle 6
FADD    [a3]    ; clock cycle 7-9
FXCH    ST(1)   ; clock cycle 7
FADD    [b3]    ; clock cycle 8-10
FXCH    ST(2)   ; clock cycle 8
FADD    [c3]    ; clock cycle 9-11
FXCH    ST(1)   ; clock cycle 9
FADD    [a4]    ; clock cycle 10-12
FXCH    ST(2)   ; clock cycle 10
FADD    [b4]    ; clock cycle 11-13
FXCH    ST(1)   ; clock cycle 11
FADD    [c4]    ; clock cycle 12-14
FXCH    ST(2)   ; clock cycle 12

In the above example we are interleaving three independent threads. Each
FADD takes 3 clock cycles, and we can start a new FADD in each clock cycle.
When we have started an FADD in the 'a' thread we have time to start two new
FADD instructions in the 'b' and 'c' threads before returning to the 'a'
thread, so every third FADD belongs to the same thread. We are using FXCH
instructions every time to get the register that belongs to the desired
thread into ST(0). As you can see in the example above, this generates a
regular pattern, but note well that the FXCH instructions repeat with a
period of two while the threads have a period of three. This can be quite
confusing, so you have to 'play computer' in order to know which registers
are where.

All versions of the instructions FADD, FSUB, FMUL, and FILD take 3 clock
cycles and are able to overlap, so that these instructions may be scheduled
using the method described above. Using a memory operand does not take more
time than a register operand if the memory operand is in the level 1 cache
and properly aligned.

By now you must be used to the rules having exceptions, and the overlapping
rule is no exception: You cannot start an FMUL instruction one clock cycle 
after another FMUL instruction, because the FMUL circuitry is not perfectly 
pipelined. It is recommended that you put another instruction in between two 
FMUL's. Example:

FLD     [a1]    ; clock cycle 1
FLD     [b1]    ; clock cycle 2
FLD     [c1]    ; clock cycle 3
FXCH    ST(2)   ; clock cycle 3
FMUL    [a2]    ; clock cycle 4-6
FXCH            ; clock cycle 4
FMUL    [b2]    ; clock cycle 5-7    (stall)
FXCH    ST(2)   ; clock cycle 5
FMUL    [c2]    ; clock cycle 7-9    (stall)
FXCH            ; clock cycle 7
FSTP    [a3]    ; clock cycle 8-9
FXCH            ; clock cycle 10     (unpaired)
FSTP    [b3]    ; clock cycle 11-12
FSTP    [c3]    ; clock cycle 13-14

Here you have a stall before  FMUL [b2]  and before  FMUL [c2]  because 
another FMUL started in the preceding clock cycle. You can improve this code 
by putting FLD instructions in between the FMUL's:

FLD     [a1]    ; clock cycle 1
FMUL    [a2]    ; clock cycle 2-4
FLD     [b1]    ; clock cycle 3
FMUL    [b2]    ; clock cycle 4-6
FLD     [c1]    ; clock cycle 5
FMUL    [c2]    ; clock cycle 6-8
FXCH    ST(2)   ; clock cycle 6
FSTP    [a3]    ; clock cycle 7-8
FSTP    [b3]    ; clock cycle 9-10
FSTP    [c3]    ; clock cycle 11-12

In other cases you may put FADD, FSUB, or anything else in between FMUL's to
avoid the stalls.

Overlapping floating point instructions requires of course that you have 
some independent threads that you can interleave. If you have only one big 
formula to execute, then you may compute parts of the formula in parallel to 
achieve overlapping. If, for example, you want to add six numbers, then you 
may split the operations into two threads with three numbers in each, and
add the two threads in the end: 

FLD     [a]     ; clock cycle 1
FADD    [b]     ; clock cycle 2-4
FLD     [c]     ; clock cycle 3
FADD    [d]     ; clock cycle 4-6
FXCH            ; clock cycle 4
FADD    [e]     ; clock cycle 5-7
FXCH            ; clock cycle 5
FADD    [f]     ; clock cycle 7-9    (stall)
FADD            ; clock cycle 10-12  (stall)

Here we have a one clock stall before  FADD [f]  because it is waiting for
the result of  FADD [d]  and a two clock stall before the last FADD because 
it is waiting for the result of  FADD [f]. The latter stall can be hidden by 
filling in some integer instructions, but the first stall can not because an 
integer instruction at this place would make the FXCH unpairable.  

The first stall can be avoided by having three threads rather than two, but 
that would cost an extra FLD so we do not save anything by having three 
threads rather than two unless there are at least eight numbers to add.  

Not all floating point instructions can overlap. And some floating point 
instructions can overlap more subsequent integer instructions than 
subsequent floating point instructions. The FDIV instruction, for example,
takes 39 clock cycles. All but the first clock cycle can overlap with
integer instructions, but only the last two clock cycles can overlap with 
floating point instructions. Example: 

FDIV            ; clock cycle 1-39
FXCH            ; clock cycle 1-2
CMC             ; clock cycle 3-4
RCR EAX,1       ; clock cycle 5
INC EBX         ; clock cycle 5
FADD [x]        ; clock cycle 38-40
FXCH            ; clock cycle 38
FMUL [y]        ; clock cycle 40-42

The first FXCH pairs with the FDIV, but takes an extra clock cycle because
it is not followed by a floating point instruction. The CMC starts before 
the FDIV is finished, but has to wait for the FXCH to finish. The RCR and
INC instructions are pairing. The FADD has to wait till clock 38 because new
floating point instructions can only execute during the last two clock
cycles of the FDIV. The second FXCH pairs with the FADD. The FMUL has to
wait for the FDIV to finish because it uses the result of the division.

If you have nothing else to put in after a floating point instruction with a
large integer overlap, such as FDIV or FSQRT, then you may put in a dummy
read from an address which you expect to need later in the program to make
sure it is in the level one cache. Example:
        FDIV    QWORD PTR [EBX]
        CMP     [ESI],EAX
        FMUL    QWORD PTR [ESI]
Here we use the integer overlap to pre-load the value at [ESI] into the
cache while the FDIV is being computed (we don't care what the result of the
CMP is).

Paragraph 21 gives a complete listing of floating point instructions, and
what they can pair or overlap with.

One floating point instruction requires special mentioning, namely FST or
FSTP with a memory operand. This instruction takes two clock cycles in the
execution stage, but it seems to start converting the value in ST(0) already
at the address decode stage in the pipeline, so that there is a one clock 
stall if the value to store is not ready one clock cycle in advance. This is 
analogous to an AGI stall. Example: 

FLD     [a1]    ; clock cycle 1
FADD    [a2]    ; clock cycle 2-4
FLD     [b1]    ; clock cycle 3
FADD    [b2]    ; clock cycle 4-6
FXCH            ; clock cycle 4
FSTP    [a3]    ; clock cycle 6-7
FSTP    [b3]    ; clock cycle 8-9

The  FSTP [a3]  stalls for one clock cycle because the result of  FADD [a2]  
is not ready in the preceding clock cycle. In many cases you cannot hide 
this type of stall without scheduling your floating point code into four 
threads or putting some integer instructions in between. No other 
instructions have this weirdness. The two clock cycles in the execution 
stage of the FST(P) instruction cannot pair or overlap with any subsequent 
instructions.  

Instructions with integer operands such as FIADD, FISUB, FIMUL, FIDIV, FICOM 
may be split up into simpler operations in order to improve overlapping. 
Example:

FILD    [a]     ; clock cycle 1-3
FIMUL   [b]     ; clock cycle 4-9

Split up into:

FILD    [a]     ; clock cycle 1-3
FILD    [b]     ; clock cycle 2-4
FMUL            ; clock cycle 5-7

In this example, you save two clocks by overlapping the two FILD instructions.


16. Loop Optimation
===================
When analyzing a program you often find that 99% of the time consumption
lies in the innermost loop. The way to improve the speed is to carefully
optimize the most time-consuming loop using ASM language. The rest of the 
program may be left in high-level language.  

A loop generally contains a counter controlling how many times to iterate,
and often array access reading or writing one array element for each 
iteration. I have chosen as example a procedure which reads integers from an 
array, changes the sign of each integer, and stores the results in another 
array.  

A C language code for this procedure would be:

void ChangeSign (int * A, int * B, int N) {
  int i;
  for (i=0; i<N; i++) B[i] = -A[i];}

Translating to assembler, we might write the procedure like this: 

Example 1:

_ChangeSign PROCEDURE NEAR
        PUSH    ESI
        PUSH    EDI
A       EQU     DWORD PTR [ESP+12]
B       EQU     DWORD PTR [ESP+16]
N       EQU     DWORD PTR [ESP+20]

        MOV     ECX, [N]
        JECXZ   L2
        MOV     ESI, [A]
        MOV     EDI, [B]
        CLD
L1:     LODSD
        NEG     EAX
        STOSD
        LOOP    L1
L2:     POP     EDI
        POP     ESI
        RET                     ; (no extra pop if C calling convention)
_ChangeSign     ENDP

This looks like a nice solution, but it is not optimal because it uses slow 
non-pairable instructions. It takes 11 clock cycles per iteration if all
data are in the level one cache.  

Using pairable instructions only
--------------------------------

Example 2:

        MOV     ECX, [N]
        MOV     ESI, [A]
        TEST    ECX, ECX
        JZ      SHORT L2
        MOV     EDI, [B]
L1:     MOV     EAX, [ESI]       ; u
        XOR     EBX, EBX         ; v (pairs)
        ADD     ESI, 4           ; u
        SUB     EBX, EAX         ; v (pairs)
        MOV     [EDI], EBX       ; u
        ADD     EDI, 4           ; v (pairs)
        DEC     ECX              ; u
        JNZ     L1               ; v (pairs)
L2:

Here we have used pairable instructions only, and scheduled the instruc-
tions so that everything pairs. It now takes only 4 clock cycles per 
iteration. We could have obtained the same speed without splitting the NEG 
instruction, but the other unpairable instructions should be split up.  

Using the same register for counter and index
---------------------------------------------

Example 3:

        MOV     ESI, [A]
        MOV     EDI, [B]
        MOV     ECX, [N]
        XOR     EDX, EDX
        TEST    ECX, ECX
        JZ      SHORT L2
L1:     MOV     EAX, [ESI+4*EDX]          ; u
        NEG     EAX                       ; u
        MOV     [EDI+4*EDX], EAX          ; u
        INC     EDX                       ; v (pairs)
        CMP     EDX, ECX                  ; u
        JB      L1                        ; v (pairs)
L2:

Using the same register for counter and index gives us fewer instructions in 
the body of the loop, but it still takes 4 clocks because we have two 
unpaired instructions.

Letting the counter end at zero
-------------------------------
We want to get rid of the CMP instruction in example 3 by letting the 
counter end at zero and use the zero flag for detecting when we are finished 
as we did in example 2. One way to do this would be to execute the loop 
backwards taking the last array elements first. However, data caches are 
optimized for accessing data forwards, not backwards, so if cache misses are 
likely, then you should rather start the counter at -N and count through 
negative values up to zero. The base registers should then point to the end 
of the arrays rather than the beginning: 

Example 4:

        MOV     ESI, [A]
        MOV     EAX, [N]
        MOV     EDI, [B]
        XOR     ECX, ECX
        LEA     ESI, [ESI+4*EAX]          ; point to end of array A
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+4*EAX]          ; point to end of array B
        JZ      SHORT L2
L1:     MOV     EAX, [ESI+4*ECX]          ; u
        NEG     EAX                       ; u
        MOV     [EDI+4*ECX], EAX          ; u
        INC     ECX                       ; v (pairs)
        JNZ     L1                        ; u
L2:

We are now down at five instructions in the loop body but it still takes 4 
clocks because of poor pairing. (If the addresses and sizes of the arrays 
are constants we may save two registers by substituting A+SIZE A for ESI and 
B+SIZE B for EDI). Now let's see how we can improve pairing.  

Pairing calculations with loop overhead
---------------------------------------
We may want to improve pairing by intermingling calculations with the loop 
control instructions. If we want to put something in between INC ECX and 
JNZ L1, it has to be something that doesn't affect the zero flag. The  MOV 
[EDI+4*ECX],EBX  instruction after  INC ECX  would generate an AGI delay, 
so we have to be more ingenious: 

Example 5:

        MOV     EAX, [N]
        XOR     ECX, ECX
        SHL     EAX, 2                    ; 4 * N
        JZ      SHORT L3
        MOV     ESI, [A]
        MOV     EDI, [B]
        SUB     ECX, EAX                  ; - 4 * N
        ADD     ESI, EAX                  ; point to end of array A
        ADD     EDI, EAX                  ; point to end of array B
        JMP     SHORT L2
L1:     MOV     [EDI+ECX-4], EAX          ; u
L2:     MOV     EAX, [ESI+ECX]            ; v (pairs)
        XOR     EAX, -1                   ; u
        ADD     ECX, 4                    ; v (pairs)
        INC     EAX                       ; u
        JNC     L1                        ; v (pairs)
        MOV     [EDI+ECX-4], EAX
L3:

I have used a different way to calculate the negative of EAX here: inverting 
all bits and adding one. The reason why I am using this method is that I can 
use a dirty trick with the INC instruction: INC doesn't change the carry 
flag, whereas ADD does. I am using ADD rather than INC to increment my loop 
counter and testing the carry flag rather than the zero flag. It is then 
possible to put the INC EAX in between without affecting the carry flag. You 
may think that we could have used  LEA EAX,[EAX+1]  here in stead of  
INC EAX,  at least that doesn't change any flags, but the LEA instruction 
would have an AGI stall so that's not the best solution.  

I have obtained perfect pairing here and the loop now takes only 3 clock 
cycles. Whether you want to increment the loop counter by 1 (as in example 
4) or by 4 (as in example 5) is a matter of taste, it makes no difference 
in loop timing.

Overlapping the end of one operation with the beginning of the next
-------------------------------------------------------------------
The method used in example 5 is not very generally applicable so we may look 
for other methods of improving pairing opportunities. One way is to 
reorganize the loop so that the end of one operation overlaps with the
beginning of the next. I will call this convoluting the loop. A convoluted
loop has an unfinished operation at the end of each loop iteration which 
will be finished in the next run. Actually, example 5 did pair the last MOV 
of one iteration with the first MOV of the next, but we want to explore this 
method further: 

Example 6:

        MOV     ESI, [A]
        MOV     EAX, [N]
        MOV     EDI, [B]
        XOR     ECX, ECX
        LEA     ESI, [ESI+4*EAX]          ; point to end of array A
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+4*EAX]          ; point to end of array B
        JZ      SHORT L3
        XOR     EBX, EBX
        MOV     EAX, [ESI+4*ECX]
        INC     ECX
        JZ      SHORT L2
L1:     SUB     EBX, EAX                  ; u
        MOV     EAX, [ESI+4*ECX]          ; v (pairs)
        MOV     [EDI+4*ECX-4], EBX        ; u
        INC     ECX                       ; v (pairs)
        MOV     EBX, 0                    ; u
        JNZ     L1                        ; v (pairs)
L2:     SUB     EBX, EAX
        MOV     [EDI+4*ECX-4], EBX
L3:

Here we begin reading the second value before we have stored the first, and 
this of course improves pairing opportunities. The MOV EBX,0 instruction 
has been put in between INC ECX and JNZ L1 not to improve pairing but to 
avoid AGI stall.

Rolling out a loop
------------------
The most generally applicable way to improve pairing opportunities is to do 
two operations for each run and do half as many runs. This is called rolling 
out a loop: 

Example 7:

        MOV     ESI, [A]
        MOV     EAX, [N]
        MOV     EDI, [B]
        XOR     ECX, ECX
        LEA     ESI, [ESI+4*EAX]          ; point to end of array A
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+4*EAX]          ; point to end of array B
        JZ      SHORT L2
        TEST    AL,1                      ; test if N is odd
        JZ      SHORT L1
        MOV     EAX, [ESI+4*ECX]          ; N is odd. do the odd one
        NEG     EAX
        MOV     [EDI+4*ECX], EAX
        INC     ECX                       ; make counter even
        JZ      SHORT L2                  ; N = 1
L1:     MOV     EAX, [ESI+4*ECX]          ; u
        MOV     EBX, [ESI+4*ECX+4]        ; v (pairs)
        NEG     EAX                       ; u
        NEG     EBX                       ; u
        MOV     [EDI+4*ECX], EAX          ; u
        MOV     [EDI+4*ECX+4], EBX        ; v (pairs)
        ADD     ECX, 2                    ; u
        JNZ     L1                        ; v (pairs)
L2:

Now we are doing two operations in parallel which gives the best pairing 
opportunities. We have to test if N is odd and if so do one operation 
outside the loop because the loop can only do an even number of operations.  

The loop has an AGI stall at the first MOV instruction because ECX has been 
incremented in the preceding clock cycle. The loop therefore takes 6 clock 
cycles for two operations.  

Reorganizing a loop to remove AGI stall
---------------------------------------
Example 8:

        MOV     ESI, [A]
        MOV     EAX, [N]
        MOV     EDI, [B]
        XOR     ECX, ECX
        LEA     ESI, [ESI+4*EAX]          ; point to end of array A
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+4*EAX]          ; point to end of array B
        JZ      SHORT L3
        TEST    AL,1                      ; test if N is odd
        JZ      L2
        MOV     EAX, [ESI+4*ECX]          ; N is odd. do the odd one
        NEG     EAX                       ; no pairing opportunity
        MOV     [EDI+4*ECX-4], EAX
        INC     ECX                       ; make counter even
        JNZ     L2
        NOP                               ; add NOP's if JNZ L2 not predictable
        NOP
        JMP     L3                        ; N = 1
L1:     NEG     EAX                       ; u
        NEG     EBX                       ; u
        MOV     [EDI+4*ECX-8], EAX        ; u
        MOV     [EDI+4*ECX-4], EBX        ; v (pairs)
L2:     MOV     EAX, [ESI+4*ECX]          ; u
        MOV     EBX, [ESI+4*ECX+4]        ; v (pairs)
        ADD     ECX, 2                    ; u
        JNZ     L1                        ; v (pairs)
        NEG     EAX
        NEG     EBX
        MOV     [EDI+4*ECX-8], EAX
        MOV     [EDI+4*ECX-4], EBX
L3:

The trick is to find a pair of instructions that do not use the loop counter
as index and reorganize the loop so that the counter is incremented in the 
preceding clock cycle. We are now down at 5 clock cycles for two operations 
which is close to the best possible.  

If data caching is critical, then you may improve the speed further by 
combining the A and B arrays into one structured array so that each B[i] 
comes immediately after the corresponding A[i]. If the structured array is 
aligned by at least 8 then B[i] will always be in the same cache line as 
A[i], so you will never have a cache miss when writing B[i]. This may of 
course have a tradeoff in other parts of the program so you have to weigh 
the costs against the benefits.  


Rolling out by more than 2
--------------------------
You may think of doing more than two operations per iteration in order to 
reduce the loop overhead per operation. But since the loop overhead in most 
cases can be reduced to only one clock cycle per iteration, then rolling out 
the loop by 4 rather than by 2 would only save 1/4 clock cycle per 
operation, which is hardly worth the effort. Only if the loop overhead cannot 
be reduced to one clock cycle and if N is very big, should you think of 
unrolling by 4.

The drawbacks of excessive loop unrolling are: 
1.  You need to calculate N MODULO R, where R is the unrolling factor, and 
    do N MODULO R operations before or after the main loop in order to make 
    the remaining number of operations divisible by R. This takes a lot of 
    extra code and poorly predictable branches. And the loop body of course 
    also becomes bigger.  
2.  A Piece of code usually takes much more time the first time it executes, 
    and the penalty of first time execution is bigger the more code you 
    have, especially if N is small.  
3.  Excessive code size makes the utilization of the code cache less 
    effective.

Handling multiple 8 or 16 bit operands simultaneously in 32 bit registers
-------------------------------------------------------------------------
If you need to manipulate arrays of 8 or 16 bit operands, then there is a
problem with unrolled loops because you may not be able to pair two memory 
access operations. For example  MOV AL,[ESI] / MOV BL,[ESI+1]  will not pair 
if the two operands are within the same dword of memory. But there may be a 
much smarter method, namely to handle four bytes at a time in the same 32 
bit register.  

The following example adds 2 to all elements of an array of bytes:

Example 9:

        MOV     ESI, [A]         ; address of byte array
        MOV     ECX, [N]         ; number of elements in byte array
        TEST    ECX, ECX         ; test if N is 0
        JZ      SHORT L2
        MOV     EAX, [ESI]       ; read first four bytes
L1:     MOV     EBX, EAX         ; copy into EBX
        AND     EAX, 7F7F7F7FH   ; get lower 7 bits of each byte in EAX
        XOR     EBX, EAX         ; get the highest bit of each byte in EBX
        ADD     EAX, 02020202H   ; add desired value to all four bytes
        XOR     EBX, EAX         ; combine bits again
        MOV     EAX, [ESI+4]     ; read next four bytes
        MOV     [ESI], EBX       ; store result
        ADD     ESI, 4           ; increment pointer
        SUB     ECX, 4           ; decrement loop counter
        JA      L1               ; loop
L2:

This loop takes 5 clock cycles for every 4 bytes. The array should of course 
be aligned by 4. If the number of elements in the array is not divisible by 
four, then you may pad it in the end with a few extra bytes to make the 
length divisible by four. This loop will always read past the end of the 
array, so you should make sure the array is not placed at the end of a 
segment to avoid a general protection error.  

Note that I have masked out the highest bit of each byte to avoid a possible 
carry from each byte into the next when adding. I am using XOR rather than 
ADD when putting in the high bit again to avoid carry.  

The  ADD ESI,4  instruction could have been avoided by using the loop 
counter as index as in example 4. However, this would give an odd number of 
instructions in the loop body, so there would be one unpaired instruction 
and the loop would still take 5 clocks. Making the branch instruction 
unpaired would save one clock after the last operation when the branch is 
mispredicted, but we would have to spend an extra clock cycle in the prolog 
code to setup a pointer to the end of the array and calculate -N, so the two 
methods will be exactly equally fast. The method presented here is the 
simplest and shortest.  

The next example finds the length of a zero-terminated string by searching 
for the first byte of zero. It is faster than using REP SCASB: 

Example 10:

STRLEN  PROC    NEAR
        MOV     EAX,[ESP+4]               ; get pointer
        MOV     EDX,7
        ADD     EDX,EAX                   ; pointer+7 used in the end
        PUSH    EBX
        MOV     EBX,[EAX]                 ; read first 4 bytes
        ADD     EAX,4                     ; increment pointer
L1:     LEA     ECX,[EBX-01010101H]       ; subtract 1 from each byte
        XOR     EBX,-1                    ; invert all bytes
        AND     ECX,EBX                   ; and these two
        MOV     EBX,[EAX]                 ; read next 4 bytes
        ADD     EAX,4                     ; increment pointer
        AND     ECX,80808080H             ; test all sign bits
        JZ      L1                        ; no zero bytes, continue loop
        TEST    ECX,00008080H             ; test first two bytes
        JNZ     SHORT L2
        SHR     ECX,16                    ; not in the first 2 bytes
        ADD     EAX,2
L2:     SHL     CL,1                      ; use carry flag to avoid a branch
        POP     EBX
        SBB     EAX,EDX                   ; compute length
        RET                               ; (or RET 4 if pascal)
STRLEN  ENDP

Again we have used the method of overlapping the end of one operation with 
the beginning of the next to improve pairing. I have not unrolled the loop 
because it is likely to repeat relatively few times. The string should of 
course be aligned by 4. The code will always read past the end of the 
string, so the string should not be placed at the end of a segment.  

The loop body has an odd number of instructions so there is one unpaired.  
Making the branch instruction unpaired rather than one of the other 
instructions has the advantage that it saves 1 clock cycle when the branch 
is mispredicted.

The  TEST ECX,00008080H  instruction is non-pairable. You could use the 
pairable instruction  OR CH,CL  here in stead, but then you would have to 
put in a NOP or something for the following reason: The loop branch  (JZ L1)  
is usually mispredicted the last time, when the loop is finished. Executing 
a branch  (JNZ L2)  in the first clock cycle after a mispredicted branch will 
give you a penalty of 5-10 clock cycles. Another problem with  OR CH,CL  is
that it would cause a partial register stall on a 486 or PentiumPro
processor. So I have chosen to keep the unpairable TEST instruction.

Handling 4 bytes simultaneously can be quite difficult. The code uses an 
algorithm which generates a nonzero value for a byte if, and only if, the 
byte is zero. This makes it possible to test all four bytes in one 
operation. This algorithm involves the subtraction of 1 from all bytes (in 
the LEA instruction). I have not masked out the highest bit of each byte
before subtracting, as I did in the previous example, so the subtraction may 
generate a borrow to the next byte, but only if it is zero, and this is 
exactly the situation where we don't care what the next byte is, because we 
are searching forwards for the first zero. If we were searching backwards 
then we would have to re-read the dword after detecting a zero, and then 
test all four bytes to find the last zero, or use BSWAP to reverse the order 
of the bytes.  

If you want to search for a byte value other than zero, then you may XOR all 
four bytes with the value you are searching for, and then use the method 
above to search for zero.  

Handling multiple operands in the same register is easier on the MMX 
processors because they have special instructions and special 64 bit 
registers for this purpose. Using the MMX instructions has a high penalty if 
you are using floating point instructions shortly afterwards, so there may 
still be situations where you want to use 32 bit registers as in the 
examples above.  


Loops with floating point operations
------------------------------------
The methods of optimizing floating point loops are basically the same as for 
integer loops, although the floating point instructions are overlapping 
rather than pairing.  

Consider the C language code:

  int i, n;  double * X;  double * Y;  double DA;
  for (i=0; i<n; i++)  Y[i] = Y[i] - DA * X[i];

This piece of code (called DAXPY) has been studied extensively because it is 
the key to solving linear equations.  

Example 11:

DSIZE   = 8                                      ; data size
        MOV     EAX, [N]                         ; number of elements
        MOV     ESI, [X]                         ; pointer to X
        MOV     EDI, [Y]                         ; pointer to Y
        XOR     ECX, ECX
        LEA     ESI, [ESI+DSIZE*EAX]             ; point to end of X
        SUB     ECX, EAX                         ; -N
        LEA     EDI, [EDI+DSIZE*EAX]             ; point to end of Y
        JZ      SHORT L3                         ; test for N = 0
        FLD     DSIZE PTR [DA]
        FMUL    DSIZE PTR [ESI+DSIZE*ECX]        ; DA * X[0]
        JMP     SHORT L2                         ; jump into loop
L1:     FLD     DSIZE PTR [DA]
        FMUL    DSIZE PTR [ESI+DSIZE*ECX]        ; DA * X[i]
        FXCH                                     ; get old result
        FSTP    DSIZE PTR [EDI+DSIZE*ECX-DSIZE]  ; store Y[i]
L2:     FSUBR   DSIZE PTR [EDI+DSIZE*ECX]        ; subtract from Y[i]
        INC     ECX                              ; increment index
        JNZ     L1                               ; loop
        FSTP    DSIZE PTR [EDI+DSIZE*ECX-DSIZE]  ; store last result
L3:

Here we are using the same methods as in example 6: Using the loop counter 
as index register and counting through negative values up to zero. The end 
of one operation overlaps with the beginning of the next.  

The interleaving of floating point operations work perfectly here: The 2 
clock stall between FMUL and FSUBR is filled with the FSTP of the previous 
result. The 3 clock stall between FSUBR and FSTP is filled with the loop 
overhead and the first two instructions of the next operation. An AGI stall 
has been avoided by reading the only parameter that doesn't depend on the 
index in the first clock cycle after the index has been incremented.  

This solution takes 6 clock cycles per operation, which is better than the 
unrolled solution published by Intel!  

Unrolling floating point loops
------------------------------
The DAXPY loop unrolled by 3 is quite complicated:

Example 12:
DSIZE = 8                                 ; data size
IF DSIZE EQ 4
SHIFTCOUNT = 2
ELSE
SHIFTCOUNT = 3
ENDIF

        MOV     EAX, [N]                  ; number of elements
        MOV     ECX, 3*DSIZE              ; counter bias
        SHL     EAX, SHIFTCOUNT           ; DSIZE*N
        JZ      L4                        ; N = 0
        MOV     ESI, [X]                  ; pointer to X
        SUB     ECX, EAX                  ; (3-N)*DSIZE
        MOV     EDI, [Y]                  ; pointer to Y
        SUB     ESI, ECX                  ; end of pointer - bias
        SUB     EDI, ECX
        TEST    ECX, ECX
        FLD     DSIZE PTR [ESI+ECX]       ; first X
        JNS     SHORT L2                  ; less than 4 operations
L1:     ; main loop
        FMUL    DSIZE PTR [DA]
        FLD     DSIZE PTR [ESI+ECX+DSIZE]
        FMUL    DSIZE PTR [DA]
        FXCH
        FSUBR   DSIZE PTR [EDI+ECX]
        FXCH
        FLD     DSIZE PTR [ESI+ECX+2*DSIZE]
        FMUL    DSIZE PTR [DA]
        FXCH
        FSUBR   DSIZE PTR [EDI+ECX+DSIZE]
        FXCH    ST(2)
        FSTP    DSIZE PTR [EDI+ECX]
        FSUBR   DSIZE PTR [EDI+ECX+2*DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+ECX+DSIZE]
        FLD     DSIZE PTR [ESI+ECX+3*DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+ECX+2*DSIZE]
        ADD     ECX, 3*DSIZE
        JS      L1                        ; loop
L2:     FMUL    DSIZE PTR [DA]            ; finish leftover operation
        FSUBR   DSIZE PTR [EDI+ECX]
        SUB     ECX, 2*DSIZE              ; change pointer bias
        JZ      SHORT L3                  ; finished
        FLD     DSIZE PTR [DA]            ; start next operation
        FMUL    DSIZE PTR [ESI+ECX+3*DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+ECX+2*DSIZE]
        FSUBR   DSIZE PTR [EDI+ECX+3*DSIZE]
        ADD     ECX, 1*DSIZE
        JZ      SHORT L3                  ; finished
        FLD     DSIZE PTR [DA]
        FMUL    DSIZE PTR [ESI+ECX+3*DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+ECX+2*DSIZE]
        FSUBR   DSIZE PTR [EDI+ECX+3*DSIZE]
        ADD     ECX, 1*DSIZE
L3:     FSTP    DSIZE PTR [EDI+ECX+2*DSIZE]
L4:

The reason why I am showing you how to unroll a loop by 3 is not to 
recommend it, but to warn you how difficult it is! Be prepared to spend a 
considerable amount of time debugging and verifying your code when doing 
something like this. There are several problems to take care of: In most 
cases, you cannot remove all stalls from a floating point loop unrolled by 
less than 4 unless you convolute it (i.e. there are unfinished operations at
the end of each run which are being finished in the next run). The last FLD 
in the main loop above is the beginning of the first operation in the next 
run. It would be tempting here to make a solution which reads past the end 
of the array and then discards the extra value in the end, as in example 9
and 10, but that is not recommended in floating point loops because the
reading of the extra value might generate a denormal operand exception in
case the memory position after the array doesn't contain a valid floating 
point number. To avoid this, we have to do at least one more operation after 
the main loop.  

The number of operations to do outside an unrolled loop would normally be
N MODULO R, where N is the number of operations, and R is the unrolling
factor.  But in the case of a convoluted loop, we have to do one more, i.e.
(N-1) MODULO R + 1, for the abovementioned reason.  

Normally, we would prefer to do the extra operations before the main loop, 
but here we have to do them afterwards for two reasons: One reason is to 
take care of the leftover operand from the convolution. The other reason is
that calculating the number of extra operations requires a division if R is
not a power of 2, and a division is time consuming. Doing the extra
operations after the loop saves the division.

The next problem is to calculate how to bias the loop counter so that it 
will change sign at the right time, and adjust the base pointers so as to 
compensate for this bias. Finally, you have to make sure the leftover 
operand from the convolution is handled correctly for all values of N.

The epilog code doing 1-3 operations could have been implemented as a
separate loop, but that would cost an extra branch misprediction, so the 
solution above is faster.  

Now that I have scared you by demonstrating how difficult it is to unroll by 
3, I will show you that it is much easier to unroll by 4: 

Example 13:

DSIZE   = 8                               ; data size
        MOV     EAX, [N]                  ; number of elements
        MOV     ESI, [X]                  ; pointer to X
        MOV     EDI, [Y]                  ; pointer to Y
        XOR     ECX, ECX
        LEA     ESI, [ESI+DSIZE*EAX]      ; point to end of X
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+DSIZE*EAX]      ; point to end of Y
        TEST    AL,1                      ; test if N is odd
        JZ      SHORT L1
        FLD     DSIZE PTR [DA]            ; do the odd operation
        FMUL    DSIZE PTR [ESI+DSIZE*ECX]
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX]
        INC     ECX                       ; adjust counter
        FSTP    DSIZE PTR [EDI+DSIZE*ECX-DSIZE]
L1:     TEST    AL,2                      ; test for possibly 2 more operations
        JZ      L2
        FLD     DSIZE PTR [DA]            ; N MOD 4 = 2 or 3. Do two more
        FMUL    DSIZE PTR [ESI+DSIZE*ECX]
        FLD     DSIZE PTR [DA]
        FMUL    DSIZE PTR [ESI+DSIZE*ECX+DSIZE]
        FXCH
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX]
        FXCH
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+DSIZE*ECX]
        FSTP    DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
        ADD     ECX, 2                    ; counter is now divisible by 4
L2:     TEST    ECX, ECX
        JZ      L4                        ; no more operations
L3:     ; main loop:
        FLD     DSIZE PTR [DA]
        FLD     DSIZE PTR [ESI+DSIZE*ECX]
        FMUL    ST,ST(1)
        FLD     DSIZE PTR [ESI+DSIZE*ECX+DSIZE]
        FMUL    ST,ST(2)
        FLD     DSIZE PTR [ESI+DSIZE*ECX+2*DSIZE]
        FMUL    ST,ST(3)
        FXCH    ST(2)
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX]
        FXCH    ST(3)
        FMUL    DSIZE PTR [ESI+DSIZE*ECX+3*DSIZE]
        FXCH
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
        FXCH    ST(2)
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE]
        FXCH
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE]
        FXCH    ST(3)
        FSTP    DSIZE PTR [EDI+DSIZE*ECX]
        FSTP    DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE]
        FSTP    DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
        FSTP    DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE]
        ADD     ECX, 4                             ; increment index by 4
        JNZ     L3                                 ; loop
L4:

It is usually quite easy to find a stall-free solution when unrolling by 4, 
and there is no need for convolution. The number of extra operations to do
outside the main loop is N MODULO 4, which can be calculated easily without 
division, simply by testing the two lowest bits in N. The extra operations
are done before the main loop rather than after, to make the handling of the 
loop counter simpler.  

The tradeoff of loop unrolling is that the extra operations outside the loop 
are slower due to incomplete overlapping and possible branch mispredictions, 
and the first time penalty is higher because of increased code size.  

As a general recommendation, I would say that if N is big or if convoluting 
the loop without unrolling cannot remove enough stalls, then you should 
unroll critical integer loops by 2 and floating point loops by 4.  


17. DISCUSSION OF SPECIAL INSTRUCTIONS
======================================

17.1 LEA
--------
The LEA instruction is useful for many purposes because it can do a shift,
two additions, and a move in just one pairable instruction taking one clock
cycle.  Example:
LEA EAX,[EBX+8*ECX-1000]
is much faster than
MOV EAX,ECX / SHL EAX,3 / ADD EAX,EBX / SUB EAX,1000
The LEA instruction can also be used to do an add or shift without changing
the flags. The source and destination need not have the same word size, so
LEA EAX,[BX]  is a useful replacement for  MOVZX EAX,BX.

You must be aware, however, that the LEA instruction will suffer an AGI
stall if it uses a base or index register which has been changed in the
preceding clock cycle.

Since the LEA instruction is pairable in the V-pipe and shift instructions
are not, you may use LEA as a substitute for a SHL by 1, 2, or 3 if you want
the instruction to execute in the V-pipe.

The 32 bit processors have no documented addressing mode with a scaled index
register and nothing else, so an instruction like  LEA EAX,[EAX*2]  is
actually coded as  LEA EAX,[EAX*2+00000000]  with an immediate displacement
of 4 bytes.  You may reduce the instruction size by instead writing  LEA
EAX,[EAX+EAX] or even better  ADD EAX,EAX.  The latter code cannot have an
AGI delay. If you happen to have a register which is zero (like a loop 
counter after a loop), then you may use it as a base register to reduce the 
code size: 

LEA EAX,[EBX*4]     ; 7 bytes
LEA EAX,[ECX+EBX*4] ; 3 bytes


17.2 TEST
---------
The TEST instruction with an immediate operand is only pairable if the
destination is AL, AX, or EAX.

TEST register,register  and  TEST register,memory  is always pairable.

Examples:
TEST ECX,ECX                ; pairable
TEST [mem],EBX              ; pairable
TEST EDX,256                ; not pairable
TEST DWORD PTR [EBX],8000H  ; not pairable
To make it pairable, use any of the following methods:
MOV EAX,[EBX] / TEST EAX,8000H
MOV EDX,[EBX] / AND  EDX,8000H
MOV AL,[EBX+1] / TEST AL,80H
MOV AL,[EBX+1] / TEST AL,AL  ; (result in sign flag)
It is also possible to test a bit by shifting it into the carry flag:
MOV EAX,[EBX] / SHR EAX,16   ; (result in carry flag)
but this method has a penalty on the PentiumPro when the shift count is more
than one.

(The reason for this non-pairability is probably that the first byte of the
2-byte instruction is the same as for some other non-pairable instructions,
and the Pentium cannot afford to check the second byte too when determining
pairability.)


17.3 XCHG
---------
The  XCHG register,memory  instruction is dangerous. By default this
instruction has an implicit LOCK prefix which prevents it from using the
cache. The instruction is therefore very time consuming, and should always
be avoided.


17.4 rotates through carry
--------------------------
RCR and RCL with a count different from one are slow and should be avoided.


17.5 string instructions
------------------------
String instructions without a repeat prefix are too slow, and should always
be replaced by simpler instructions. The same applies to LOOP and JECXZ.  

String instructions with repeat may be optimal. Always use the dword version 
if possible, and make sure that both source and destination are aligned by 4.

REP MOVSD is the fastest way to move blocks of data when the destination is
in the cache. See section 19 for an alternative.

REP STOSD is optimal when the destination is in the cache.

REP LOADS, REP SCAS, and REP CMPS are not optimal, and may be replaced by
loops. See section 16 example 10 for an alternative to REP SCASB.


17.6 bit scan
-------------
BSF and BSR are the poorest optimized instructions on the Pentium, taking
approximately 11 + 2*n clock cycles, where n is the number of zeros skipped.
(on later processors it takes only 1 or 2)

The following code emulates BSR ECX,EAX:
        TEST    EAX,EAX
        JZ      SHORT BS1
        MOV     DWORD PTR [TEMP],EAX
        MOV     DWORD PTR [TEMP+4],0
        FILD    QWORD PTR [TEMP]
        FSTP    QWORD PTR [TEMP]
        WAIT    ; WAIT only needed for compatibility with earlier processors
        MOV     ECX, DWORD PTR [TEMP+4]
        SHR     ECX,20
        SUB     ECX,3FFH
        TEST    EAX,EAX       ; clear zero flag
BS1:

The following code emulates BSF ECX,EAX:
        TEST    EAX,EAX
        JZ      SHORT BS2
        XOR     ECX,ECX
        MOV     DWORD PTR [TEMP+4],ECX
        SUB     ECX,EAX
        AND     EAX,ECX
        MOV     DWORD PTR [TEMP],EAX
        FILD    QWORD PTR [TEMP]
        FSTP    QWORD PTR [TEMP]
        WAIT    ; WAIT only needed for compatibility with earlier processors
        MOV     ECX, DWORD PTR [TEMP+4]
        SHR     ECX,20
        SUB     ECX,3FFH
        TEST    EAX,EAX       ; clear zero flag
BS2:


17.7 bit test
-------------
BT, BTC, BTR, and BTS instructions should preferably be replaced by
instructions like TEST, AND, OR, XOR, or shifts.


17.8 integer multiplication
---------------------------
An integer multiplication takes approximately 9 clock cycles. It is
therefore advantageous to replace a multiplication by a constant with a
combination of other instructions such as SHL, ADD, SUB, and LEA.  Example: 
IMUL EAX,10
can be replaced with
MOV EBX,EAX / ADD EAX,EAX / SHL EBX,3 / ADD EAX,EBX
or
LEA EAX,[EAX+4*EAX] / ADD EAX,EAX

Floating point multiplication is faster than integer multiplication on a 
Pentium without MMX, but the time used to convert integers to float and 
convert the product back again is usually more than the time saved by using
floating point multiplication, except when the number of conversions is low 
compared with the number of multiplications.


17.9 division
-------------
Division is quite time consuming. The DIV instruction takes 17, 25, or 41
clock cycles for byte, word, and dword divisors respectively. The IDIV
instruction takes 5 clock cycles more. It is therefore preferable to use the
smallest operand size possible that won't generate an overflow, even if it
costs an operand size prefix, and use unsigned division if possible.

Unsigned division by a power of two can be done with SHR.  Division of a
signed number by a power of two can be done with SAR, but the result with
SAR is rounded towards minus infinity, whereas the result with IDIV is
truncated towards zero.

Dividing a positive number N (< 2^16) with a constant D which is not a
power of two can be done by multiplying N with 2^16 / D and then dividing
with 2^16. For example, to divide a number with 5, you may multiply with
2^16 / 5 = 3333H, and then divide with 2^16:

INC EAX  /  IMUL EAX,3333H  /  SHR EAX,16

The  INC EAX  is added to compensate for the double rounding error.
This method works for 0 <= N <= 2^16. You have to test the code thoroughly
to make sure the rounding is correct.

If you want to replace the IMUL with faster pairable instructions, then
you may write:

        LEA     EBX,[EAX+2*EAX+3]
        LEA     ECX,[EAX+2*EAX+3]
        SHL     EBX,4
        MOV     EAX,ECX
        SHL     ECX,8
        ADD     EAX,EBX
        SHL     EBX,8
        ADD     EAX,ECX
        ADD     EAX,EBX
        SHR     EAX,16

You can use the same code to divide with 10. Just shift right by 17 in
stead of 16 in the end.


Floating point division takes 39 clock cycles for the highest precision.
You can save time by specifying a lower precision in the floating point
control word (only FDIV and FIDIV are faster at low precision, no other
instructions can be speeded up this way).

It is possible to do a floating point division and an integer division in
parallel to save time.  Example: A = A1 / A2;  B = B1 / B2

FILD [B1]
FILD [B2]
MOV EAX,[A1]
MOV EBX,[A2]
CDQ
FDIV
DIV EBX
FISTP [B]
MOV [A],EAX
(make sure you set the floating point control word to the desired
rounding method)

Obviously, you should always try to minimize the number of divisions.
Floating point division with a constant or repeated division with the same
value should of course by done by multiplying with the reciprocal.
But there are many other situations where you can reduce the number of
divisions. For example:
if (A/B > C)...  can be rewritten as  if (A > B*C)...  when B is
positive, and the opposite when B is negative.

A/B + C/D  can be rewritten as  (A*D + C*B) / (B*D)

If you are using integer division, then you should be aware that the
rounding errors may be different when you rewrite the formulas.


17.10 WAIT
----------
You can often increase speed by omitting the WAIT instruction.
The WAIT instruction has three functions:

  a. The 8087 processor requires a WAIT before _every_ floating point
     instruction.

  b. WAIT is used to coordinate memory access between the floating point
     unit and the integer unit. Examples:
     b.1.  FISTP [mem32]
           WAIT             ; wait for f.p. unit to write before..
           MOV EAX,[mem32]  ; reading the result with the integer unit

     b.2.  FILD [mem32]
           WAIT             ; wait for f.p. unit to read value before..
           MOV [mem32],EAX  ; overwriting it with integer unit

     b.3.  FLD DWORD PTR [ESP]
           WAIT             ; prevent an accidental hardware interrupt from..
           ADD ESP,4        ; overwriting value on stack before it is read

  c. WAIT is sometimes used to check for exceptions. It will generate an 
     interrupt if there is an unmasked exception bit in the f.p. status word 
     set by a preceding floating point instruction.

Regarding a:
The function in point a is never needed on any other processors than the old 
8087. Unless you want your code to be compatible with the 8087 you should
tell your assembler to not put in these WAITs by specifying a higher
processor.

Regarding b:
WAIT instructions to coordinate memory access are definitely needed on the 
8087 and 80287. A superscalar processor like the Pentium has special
circuitry to detect memory conflicts so you wouldn't need the WAIT for this
purpose on code that only runs on a Pentium or higher. I have made some
tests on other Intel processors and not been able to provoke any error by
omitting the WAIT on any 32 bit Intel processor, although Intel manuals say 
that the WAIT is needed for this purpose except after FNSTSW and FNSTCW. If 
you want to be certain that your code will work on any 32 bit processor 
(including non-Intel processors) then I would recommend that you include the 
WAIT here in order to be safe.  

Regarding c:
The assembler automatically inserts a WAIT for this purpose before the
following instructions: 
FCLEX, FINIT, FSAVE, FSTCW, FSTENV, FSTSW
You can omit the WAIT by writing FNCLEX, etc. My tests show that the WAIT is
unneccessary in most cases because these instructions without WAIT will 
still generate an interrupt on exceptions except for FNCLEX and FNINIT on 
the 80387. (There is some inconsistency about whether the IRET from the
interrupt points to the FN.. instruction or to the next instruction).

Almost all other floating point instructions will also generate an interrupt 
if a previous floating point instruction has set an unmasked exception bit, 
so the exception is likely to be detected sooner or later anyway.  

You may still need the WAIT if you want to know exactly where an exception
occurred in order to recover from the situation. Consider, for example, the 
code under b.3 above: If you want to be able to recover from an exception 
generated by the FLD here, then you need the WAIT because an interrupt after 
ADD ESP,4 would overwrite the value to load.


17.11 FCOM + FSTSW AX
---------------------
The usual way of doing floating point comparisons is:
FLD [a]
FCOMP [b]
FSTSW AX
SAHF
JB ASmallerThanB

You may improve this code by using FNSTSW AX rather than FSTSW AX and 
test AH directly rather than using the non-pairable SAHF.
(TASM version 3.0 has a bug with the FNSTSW AX instruction)

FLD [a]
FCOMP [b]
FNSTSW AX
SHR AH,1
JC ASmallerThanB

Testing for zero or equality:

FTST
FNSTSW AX
AND AH,40H
JNZ IsZero     ; (the zero flag is inverted!)

Test if greater:

FLD [a]
FCOMP [b]
FNSTSW AX
AND AH,41H
JZ AGreaterThanB

Do not use TEST AH,41H as it is not pairable. Do not use TEST EAX,4100H as
it would produce a partial register stall on the PentiumPro. Do not test the
flags after multibit shifts, as this has a penalty on the PentiumPro.  

It is often faster to use integer instructions for comparing floating point
values, as described in paragraph 18 below.


17.12 FISTP
-----------
Converting a floating point number to integer is normally done like this:
        FISTP   DWORD PTR [TEMP]
        MOV     EAX, [TEMP]

An alternative method is:

.DATA
ALIGN 8
TEMP    DQ      ?
MAGIC   DD      59C00000H   ; f.p. representation of 2^51 + 2^52

.CODE
        FADD    [MAGIC]
        FSTP    QWORD PTR [TEMP]
        MOV     EAX, DWORD PTR [TEMP]

Adding the 'magic number' of 2^51 + 2^52 has the effect that any integer
between -2^31 and +2^31 will be aligned in the lower 32 bits when storing
as a double precision floating point number. The result is the same as you
get with FISTP for all rounding methods except truncation towards zero.
The result is different from FISTP if the control word specifies truncation
or in case of overflow. You may need a WAIT instruction for compatibility
with other processors. See section 17.10.

This method is not faster than using FISTP, but it gives better scheduling
opportunities because there is a 3 clock void between FADD and FSTP which
may be filled with other instrucions. You may multiply or divide the number
by a power of 2 in the same operation by doing the opposite to the magic
number. You may also add a constant by adding it to the magic number,
which then has to be double precision.


17.13 FPTAN
-----------
According to the manuals, FPTAN returns two values X and Y and leaves it to
the programmer to divide Y with X to get the result, but in fact it always
returns 1 in X so you can save the division. My tests show that on all 32
bit Intel processors with floating point unit or coprocessor, FPTAN always 
returns 1 in X regardless of the argument. If you want to be absolutely sure
that your code will run correctly on all processors, then you may test if X
is 1, which is faster than dividing with X. The Y value may be very high,
but never infinity, so you don't have to test if Y contains a valid value.


18. USING INTEGER INSTRUCTIONS TO DO FLOATING POINT OPERATIONS
==============================================================
Integer instructions are generally faster than floating point instructions,
so it is often advantageous to use integer instructions for doing simple 
floating point operations. The most obvious example is moving data. Example: 

FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI]

Change to:

MOV EAX,[ESI] / MOV EBX,[ESI+4] / MOV [EDI],EAX / MOV [EDI+4],EBX

The former code takes 4 clocks, the latter takes 2.

Testing if a floating point value is zero:

The floating point value of zero is usually represented as 32 or 64 bits of 
zero, but there is a pitfall here: The sign bit may be set! Minus zero is 
regarded as a valid floating point number, and the processor may actually 
generate a zero with the sign bit set if for example multiplying a negative 
number with zero. So if you want to test if a floating point number is zero,
you should not test the sign bit. Example:

FLD DWORD PTR [EBX] / FTST / FNSTSW AX / AND AH,40H / JNZ IsZero

Use integer instructions in stead, and shift out the sign bit:

MOV EAX,[EBX] / ADD EAX,EAX / JZ IsZero

The former code takes 9 clocks, the latter takes only 2.
If the floating point number is double precision (QWORD) then you only have
to test bit 32-62. If they are zero, then the lower half will also be zero
if it is a normal floating point number.

Testing if negative:
A floating point number is negative if the sign bit is set and at least one 
other bit is set. Example:
MOV EAX,[NumberToTest] / CMP EAX,80000000H / JA IsNegative

Manipulating the sign bit:
You can change the sign of a floating point number simply by flipping the sign
bit. Example:
XOR BYTE PTR [a] + (TYPE a) - 1, 80H

Likewise you may get the absolute value of a floating point number by simply
ANDing out the sign bit.

Comparing numbers:
Floating point numbers are stored in a unique format which allows you to use
integer instructions for comparing floating point numbers, except for the
sign bit. If you are certain that two floating point numbers both are normal
and positive then you may simply compare them as integers. Example:

FLD [a] / FCOMP [b] / FNSTSW AX / AND AH,1 / JNZ ASmallerThanB

Change to:

MOV EAX,[a] / MOV EBX,[b] / CMP EAX,EBX / JB ASmallerThanB

This method only works if the two numbers have the same precision and you
are certain that none of the numbers have the sign bit set.

If negative numbers are possible, then you have to convert the
negative numbers to 2-complement, and do a signed compare:

MOV EAX,[a]
MOV EBX,[b]
MOV ECX,EAX
MOV EDX,EBX
SAR ECX,31              ; copy sign bit
AND EAX,7FFFFFFFH       ; remove sign bit
SAR EDX,31
AND EBX,7FFFFFFFH
XOR EAX,ECX             ; make 2-complement if sign bit was set
XOR EBX,EDX
SUB EAX,ECX
SUB EBX,EDX
CMP EAX,EBX
JL  ASmallerThanB       ; signed comparison

This method works for all normal floating point numbers, including -0.


19. USING FLOATING POINT INSTRUCTIONS TO DO INTEGER OPERATIONS
==============================================================
19.1 Moving data
----------------
Floating point instructions can be used to move 8 bytes at a time:
FILD QWORD PTR [ESI] / FISTP QWORD PTR [EDI]
This is only an advantage if the destination is not in the cache. The
optimal way to move a block of data to uncached memory on the Pentium is:

TopOfLoop:
FILD QWORD PTR [ESI]
FILD QWORD PTR [ESI+8]
FXCH
FISTP QWORD PTR [EDI]
FISTP QWORD PTR [EDI+8]
ADD ESI,16
ADD EDI,16
DEC ECX
JNZ TopOfLoop

The source and destination should of course be aligned by 8. The extra time
used by the slow FILD and FISTP instructions is compensated for by the fact
that you only have to do half as many write operations.  Note that this 
method is only advantageous on the Pentium and only if the destination is 
not in the cache. On all other processors the optimal way to move blocks of 
data is REP MOVSD, or if you have a processor with MMX you may use the MMX 
instructions in stead to write 8 bytes at a time.

19.2 Integer multiplication
---------------------------
Floating point multiplication is faster than integer multiplication on the 
Pentium without MMX, but the price for converting integer factors to float
and converting the result back to integer is high, so floating point 
multiplication is only advantageous if the number of conversions needed is 
low compared with the number of multiplications. Integer multiplication is
faster than floating point on other processors.

It may be tempting to use denormal floating point operands to save some of
the conversions here, but the handling of denormals is very slow, so this
is not a good idea!

19.3 Integer division
---------------------
Floating point division is not faster than integer division, but you can do 
other integer operations (including integer division, but not integer 
multiplication) while the floating point unit is working on the division. 
See paragraph 17.9 above for an example.

19.4 Converting binary to decimal numbers
-----------------------------------------
Using the FBSTP instruction is a simple and convenient way of converting a 
binary number to decimal, although not necessarily the fastest method.


20. LIST OF INTEGER INSTRUCTIONS
================================
Explanations:
Operands: r=register, m=memory, i=immediate data, sr=segment register
m32= 32 bit memory operand, etc.

Clock cycles:
The numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably.

Pairability:
u=pairable in U-pipe, v=pairable in V-pipe, uv=pairable in either pipe,
np=not pairable 

Opcode                 Operands            Clock cycles        Pairability
----------------------------------------------------------------------------
NOP                                        1                   uv
MOV                    r/m, r/m/i          1                   uv
MOV                    r/m, sr             1                   np
MOV                    sr,  r/m            >= 2 b)             np
XCHG                   (E)AX, r            2                   np
XCHG                   r  ,   r            3                   np
XCHG                   r  ,   m            >20                 np
XLAT                                       4                   np
PUSH                   r/i                 1                   uv
POP                    r                   1                   uv
PUSH                   m                   2                   np
POP                    m                   3                   np
PUSH                   sr                  1 b)                np
POP                    sr                  >= 3 b)             np
PUSHF                                      4                   np
POPF                                       6                   np
PUSHA POPA                                 5                   np
LAHF SAHF                                  2                   np
MOVSX MOVZX            r, r/m              3 a)                np
LEA                    r/m                 1                   uv
LDS LES LFS LGS LSS    m                   4 c)                np
ADD SUB AND OR XOR     r  , r/i            1                   uv
ADD SUB AND OR XOR     r  , m              2                   uv
ADD SUB AND OR XOR     m  , r/i            3                   uv
CMP                    r  , r/i            1                   uv
CMP                    m  , r/i            2                   uv
TEST                   r  , r              1                   uv
TEST                   m  , r              2                   uv
TEST                   r  , i              1                   f)
TEST                   m  , i              2                   np
ADC SBB                r/m, r/m/i          1/3                 u
INC DEC                r                   1                   uv
INC DEC                m                   3                   uv
NEG NOT                r/m                 1/3                 np
MUL IMUL               r8/r16/m8/m16      11                   np
MUL IMUL               all other versions  9 d)                np
DIV                    r8/r16/r32          17/25/41            np
IDIV                   r8/r16/r32          22/30/46            np
CBW CWDE                                   3                   np
CWD CDQ                                    2                   np
SHR SHL SAR SAL        r  , i              1                   u
SHR SHL SAR SAL        m  , i              3                   u
SHR SHL SAR SAL        r/m, CL             4/5                 np
ROR ROL RCR RCL        r/m, 1              1/3                 u
ROR ROL                r/m, i(><1)         1/3                 np
ROR ROL                r/m, CL             4/5                 np
RCR RCL                r/m, i(><1)         8/10                np
RCR RCL                r/m, CL             7/9                 np
SHLD SHRD              r, i/CL             4 a)                np
SHLD SHRD              m, i/CL             5 a)                np
BT                     r, r/i              4 a)                np
BT                     m, i                4 a)                np
BT                     m, r                9 a)                np
BTR BTS BTC            r, r/i              7 a)                np
BTR BTS BTC            m, i                8 a)                np
BTR BTS BTC            m, r               14 a)                np
BSF BSR                r  , r/m            7-73 a)             np
SETcc                  r/m                 1/2 a)              np
JMP CALL               short/near          1                   v
JMP CALL               far                 >= 3                np
conditional jump       short/near          1/4/5 e)            v
CALL JMP               r/m                 2                   np
RETN                                       2                   np
RETN                   i                   3                   np
RETF                                       4                   np
RETF                   i                   5                   np
J(E)CXZ                short               5-8                 np
LOOP                   short               5-9                 np
BOUND                  r  , m              8                   np
CLC STC CMC CLD STD                        2                   np
CLI STI                                    6-7                 np
LODS                                       2                   np
REP LODS                                   7+3*n g)            np
STOS                                       3                   np
REP STOS                                   10+n  g)            np
MOVS                                       4                   np
REP MOVSB                                  12+1.8*n g)         np
REP MOVSW                                  12+1.5*n g)         np
REP MOVSD                                  12+n     g)         np
SCAS                                       4                   np
REP(N)E SCAS                               9+4*n    g)         np
CMPS                                       5                   np
REP(N)E CMPS                               8+5*n    g)         np
BSWAP                                      1 a)                np
----------------------------------------------------------------------------
Notes:
a) this instruction has a 0FH prefix which takes one clock cycle extra to
   decode on a Pentium without MMX unless preceded by a multicycle
   instruction (see section 13 above).
b) versions with FS and GS have a 0FH prefix. see note a.
c) versions with SS, FS, and GS have a 0FH prefix. see note a.
d) versions with two operands and no immediate have a 0FH prefix, see note a.
e) see section 12 above
f) only pairable if register is accumulator. see paragraph 17.2 above
g) add one clock cycle for decoding the repeat prefix unless preceded by a
   multicycle instruction (such as CLD. see section 13 above).


21. LIST OF FLOATING POINT INSTRUCTIONS
=======================================
Explanations:
Operands: r=register, m=memory, m32=32 bit memory operand, etc.

Clock cycles:
The numbers are minimum values. Cache misses, misalignment, denormal
operands, and exceptions may increase the clock counts considerably.

Pairability:
+=pairable with FXCH, np=not pairable with FXCH.

i-ov:
Overlap with integer instructions. i-ov = 4 means that the last four clock 
cycles can overlap with subsequent integer instructions.  

fp-ov:
Overlap with floating point instructions. fp-ov = 2 means that the last two 
clock cycles can overlap with subsequent floating point instructions.
(WAIT is considered a floating point instruction here)
                                       
Opcode               Operand     Clock cycles    Pairability    i-ov   fp-ov
-----------------------------------------------------------------------------
FLD                  r/m32/m64         1         +              0      0
FLD                  m80               3         np             0      0
FBLD                 m80              49         np             0      0
FST(P)               r                 1         np             0      0
FST(P)               m32/m64           2 h)      np             0      0
FST(P)               m80               3 h)      np             0      0
FBSTP                m80             153         np             0      0
FILD                 m                 3         np             2      2
FIST(P)              m                 6         np             0      0
FLDZ FLD1                              2         np             0      0
FLDPI FLDL2E etc.                      5         np             0      0
FNSTSW               AX/m16            6         np             0      0
FLDCW                m16               8         np             0      0
FNSTCW               m16               2         np             0      0

FADD(P)              r/m               3         +              2      2
FSUB(R)(P)           r/m               3         +              2      2
FMUL(P)              r/m               3         +              2      2 i)
FDIV(R)(P)           r/m        19/33/39 l)      + j)          38 k)   2
FCHS FABS                              1         +              0      0
FCOM(P)(P) FUCOM     r/m               1         +              0      0
FIADD FISUB(R)       m                 6         np             2      2
FIMUL                m                 6         np             2      2
FIDIV(R)             m          22/36/42 l)      np            38 k)   2
FICOM                m                 4         np             0      0
FTST                                   1         np             0      0
FXAM                                  17         np             4      0
FPREM                              16-64         np             2      2
FPREM1                             20-70         np             2      2
FRNDINT                               19         np             0      0
FSCALE                                32         np             5      0
FXTRACT                            12-66         np             0      0

FSQRT                                 70         np            69 k)   2
FSIN FCOS FSINCOS                 varies         np             2      2
F2XM1 FYL2X FYL2XP1               varies         np             2      2
FPATAN                            varies         np             2      2
FPTAN                             varies         np            36 k)   0

FNOP                                   2         np             0      0
FXCH                 r                 1         np             0      0
FINCSTP FDECSTP                        2         np             0      0
FFREE                r                 2         np             0      0
FNCLEX                                 6-9       np             0      0
FNINIT                                22         np             0      0
FNSAVE               m            ca.300         np             0      0
FRSTOR               m                73         np             0      0
WAIT                                   1         np             0      0
-----------------------------------------------------------------------------
Notes:
h) The value to store is needed one clock cycle in advance.
i) 1 if the overlapping instruction is also an FMUL.
j) If the FXCH is followed by an integer instruction then it will pair
   imperfectly and take an extra clock cycle so that the integer
   instruction will begin in clock cycle 3.
k) Cannot overlap integer multiplication instructions.
l) FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision
   respectively. FIDIV takes 3 clocks more. The precision is defined by bit
   8-9 of the floating point control word.


22. TESTING SPEED
=================
The Pentium has an internal 64 bit clock counter which can be read into 
EDX:EAX using the instruction RDTSC (read time stamp counter). This is very
useful for testing exactly how many clock cycles a piece of code takes.

The program below is useful for measuring the number of clock cycles a piece
of code takes. The program executes the code to test 10 times and stores the
10 clock counts. The program can be used in both 16 and 32 bit mode.

RDTSC   MACRO                   ; define RDTSC instruction
        DB      0FH,31H
ENDM

ITER    EQU     10              ; number of iterations

.DATA                           ; data segment
ALIGN   4
COUNTER DD      0               ; loop counter
TICS    DD      0               ; temporary storage of clock
RESULTLIST  DD  ITER DUP (0)    ; list of test results

.CODE                           ; code segment
BEGIN:  MOV     [COUNTER],0     ; reset loop counter
TESTLOOP:                       ; test loop
;****************   Do any initializations here:    ************************
        FINIT
;****************   End of initializations          ************************
        RDTSC                   ; read clock counter
        MOV     [TICS],EAX      ; save count
        CLD                     ; non-pairable filler
REPT    8
        NOP                     ; eight NOP's to avoid shadowing effect
ENDM

;****************   Put instructions to test here:  ************************
        FLDPI                   ; this is only an example
        FSQRT
        RCR     EBX,10
        FSTP    ST
;********************* End of instructions to test  ************************

        CLC                     ; non-pairable filler with shadow
        RDTSC                   ; read counter again
        SUB     EAX,[TICS]      ; compute difference
        SUB     EAX,15          ; subtract the clocks cycles used by fillers
        MOV     EDX,[COUNTER]   ; loop counter
        MOV     [RESULTLIST][EDX],EAX   ; store result in table
        ADD     EDX,TYPE RESULTLIST     ; increment counter
        MOV     [COUNTER],EDX           ; store counter
        CMP     EDX,ITER * (TYPE RESULTLIST)
        JB      TESTLOOP                ; repeat ITER times

; insert here code to read out the values in RESULTLIST

The 'filler' instructions before and after the piece of code to test are 
critical. The CLD is a non-pairable instruction which has been inserted to 
make sure the pairing is the same the first time as the subsequent times. 
The eight NOP instructions are inserted to prevent any prefixes in the code 
to test to be decoded in the shadow of the preceding instructions. Single 
byte instructions are used here to obtain the same pairing the first time as 
the subsequent times. The CLC after the code to test is a non-pairable
instruction which has a shadow under which the 0FH prefix of the RDTSC can 
be decoded so that it is independent of any shadowing effect from the code 
to test.

On the PentiumPro you have to put in a serializing instruction like CPUID 
before and after each RDTSC to prevent it from executing in parallel with 
anything else. CPUID has a decoding delay on the Pentium without MMX because 
of the 0FH prefix, but it has no shadow because it is serializing.

The RDTSC instruction cannot execute in virtual mode, so if you are running 
under DOS you must skip the EMM386 (or any other memory manager) in your
CONFIG.SYS and not run under a DOS box in Windows.

The Pentium processor has special performance monitor counters which can 
count events such as cache misses, misalignments, AGI stalls, etc. Details 
about how to use the performance monitor counters are not covered by this 
manual and must be sought elsewhere.  


23. CONSIDERATIONS FOR OTHER MICROPROCESSORS
============================================
Most of the optimations described in this document have little or no 
negative effects on other microprocessors, including non-Intel processors, 
but there are some problems to be aware of.  

Using a full register after writing to part of the register will cause a 
moderate delay on the 80486 and a severe delay on the PentiumPro. Example:
MOV AL,[EBX] / MOV ECX,EAX
On the PentiumPro you may avoid this penalty by zeroing the full register 
first:
XOR EAX,EAX / MOV AL,[EBX] / MOV ECX,EAX
or by using MOVZX EAX,BYTE PTR [EBX]

Scheduling floating point code for the Pentium often requires a lot of extra
FXCH instructions. This will slow down execution on earlier microprocessors,
but not on the PentiumPro and advanced non-Intel processors.

As mentioned in the introduction, Intel has announced new MMX versions of
the Pentium and PentiumPro chips with special instructions for integer
vector operations. These instructions will be very useful for massively
parallel integer calculations.

The PentiumPro chip is faster than the Pentium in some respects, but 
inferior in other respects. Knowing the strong and weak sides of the 
PentiumPro can help you make your code work well on both processors.  

The most important advantage of the PentiumPro is that it does much of the 
optimation for you: reordering instructions and splitting complex 
instructions into simple ones. But for perfectly optimized code there is 
less difference between the two processors.  

The two processors have basically the same number of execution units, so the 
throughput should be near the same. The PPro has separate units for memory 
read and write so that it can do three operations simultaneously if one of 
them is a memory read, but on the other hand it cannot do two memory reads 
or two writes simultaneously as the Pentium can.  

The PPro is better than the Pentium in the following respects:

- out of order execution

- one cache miss does not delay subsequent independent instructions

- splitting complex instructions into smaller micro-ops

- automatic register renaming to avoid unnecessary dependencies

- better jump prediction algorithm than Pentium without MMX

- many instructions which are unpairable and poorly optimized on the Pentium
  perform better on the PPro, f.ex. integer multiplication, movzx, cdq, bit
  scan, bit test, shifts by cl, and floating point store

- floating point instructions and simple integer instructions can execute
  simultaneously

- memory reads and writes do not occupy the ALU's

- indirect memory read instructions have no AGI stall

- new conditional move instructions can be used in stead of branches in some 
  cases

- new FCOMI instruction eliminates the need for the slow FNSTSW AX 

- higher maximum clock frequency

The PPro is inferior to the Pentium in the following respects:

- mispredicted jumps are very expensive (10-15 clock cycles!)

- poor performance on 16 bit code and segmented models

- prefixes are expensive (except 0FH extended opcode)

- long stall when mixing 8, 16, and 32 bit registers

- fadd, fsub, fmul, fchs have longer latency

- cannot do two memory reads or two memory writes simultaneously

- some instruction combinations cannot execute in parallel, like
  push+push, push+call, compare+conditional jump

As a consequence of this, the Pentium Pro may actually be slower than the 
Pentium on perfectly optimized code with a lot of unpredictable branches, 
and a lot of floating point code with little or no natural parallelism.  

Most of the drawbacks of each processor can be circumvented by careful 
optimation and running 32 bit flat mode. But the problem with mispredicted 
jumps on the PPro cannot be avoided except in the cases where you can use a 
conditional move instead.  

Taking advantage of the new instructions in the MMX and PentiumPro 
processors will create problems if you want your code to be compatible with 
earlier microprocessors. The solution may be to write several versions of 
your code, each optimized for a particular processor. Your program should 
automatically detect which processor it is running on and select the 
appropriate version of code. Such a complicated approach is of course only 
needed for the most critical parts of your program.